当前位置:网站首页>NetApp FAS系列一个CIFS bug引起的控制器重启案例分享
NetApp FAS系列一个CIFS bug引起的控制器重启案例分享
2022-07-24 02:14:00 【存储服务专家】
客户有一台IBM的N3240,对应到NetApp就是FAS2240,经常发生控制器重启,如果顺利重启,客户基本上就没有感受到,但是有些情况,重启没有成功,一个控制器掉了,客户业务就有明显的感知。客户的做法就是把控制器再重启一下,就是在loader下,Boot_ontap 一下,机器又顺利启动了。这样反反复复,终于有一天,最终用户怒了,要维保商分析原因。下面就是这个case的分析过程分享。
查看系统的event日志会发现,这个控制器几乎每天都有重启,甚至每天有好几次,2022半年来,重启了40多次,当然这个里面有些是处理问题重启的,有些是系统自动重启的。如下是部分event log。
Record 3180: Sun Jan 16 05:33:45 2022 [SP.critical]: Filer Reboots
Record 3195: Mon Jan 17 15:21:30 2022 [SP.critical]: Filer Reboots
Record 3226: Sat Feb 5 05:54:55 2022 [SP.critical]: Filer Reboots
Record 3241: Mon Feb 7 15:07:24 2022 [SP.critical]: Filer Reboots
Record 3280: Sun Mar 6 03:43:36 2022 [SP.critical]: Filer Reboots
Record 3294: Tue Mar 8 01:52:48 2022 [SP.critical]: Filer Reboots
Record 3307: Tue Mar 8 05:42:47 2022 [SP.critical]: Filer Reboots
Record 3328: Mon Mar 14 15:43:53 2022 [SP.critical]: Filer Reboots
Record 3349: Wed Mar 23 07:33:02 2022 [SP.critical]: Filer Reboots
Record 3362: Wed Mar 23 10:08:18 2022 [SP.critical]: Filer Reboots
Record 3377: Fri Mar 25 05:57:46 2022 [SP.critical]: Filer Reboots
Record 3393: Mon Mar 28 05:42:58 2022 [SP.critical]: Filer Reboots
Record 3411: Sat Apr 2 02:12:31 2022 [SP.critical]: Filer Reboots
Record 3429: Tue Apr 5 15:05:40 2022 [SP.critical]: Filer Reboots
Record 3449: Wed Apr 13 00:53:43 2022 [SP.critical]: Filer Reboots
Record 3476: Wed Apr 27 13:09:58 2022 [SP.critical]: Filer Reboots
Record 3493: Sun May 1 13:18:01 2022 [SP.critical]: Filer Reboots
Record 3524: Thu May 19 01:49:50 2022 [SP.critical]: Filer Reboots
Record 3539: Sat May 21 06:40:10 2022 [SP.critical]: Filer Reboots
Record 3553: Sun May 22 16:17:47 2022 [SP.critical]: Filer Reboots
Record 3568: Tue May 24 13:24:54 2022 [SP.critical]: Filer Reboots
Record 3598: Fri Jun 10 13:26:04 2022 [SP.critical]: Filer Reboots
Record 3615: Wed Jun 15 00:14:00 2022 [SP.critical]: Filer Reboots
Record 3629: Thu Jun 16 11:33:34 2022 [SP.critical]: Filer Reboots
Record 3644: Fri Jun 17 05:47:57 2022 [SP.critical]: Filer Reboots
Record 3657: Fri Jun 17 12:12:42 2022 [SP.critical]: Filer Reboots
Record 3676: Fri Jun 24 00:05:47 2022 [SP.critical]: Filer Reboots
Record 3690: Sat Jun 25 16:26:57 2022 [SP.critical]: Filer Reboots
Record 3705: Mon Jun 27 05:35:27 2022 [SP.critical]: Filer Reboots
Record 3720: Wed Jun 29 10:53:57 2022 [SP.critical]: Filer Reboots
Record 3736: Sat Jul 2 12:43:12 2022 [SP.critical]: Filer Reboots
Record 3750: Mon Jul 4 03:23:30 2022 [SP.critical]: Filer Reboots
Record 3766: Thu Jul 7 10:30:59 2022 [SP.critical]: Filer Reboots
Record 3779: Thu Jul 7 12:00:53 2022 [SP.critical]: Filer Reboots
Record 3794: Fri Jul 8 07:10:19 2022 [SP.critical]: Filer Reboots
Record 3807: Sat Jul 9 01:27:50 2022 [SP.critical]: Filer Reboots
Record 3822: Sun Jul 10 11:46:48 2022 [SP.critical]: Filer Reboots
Record 3836: Tue Jul 12 04:32:41 2022 [SP.critical]: Filer Reboots
Record 3850: Wed Jul 13 22:39:10 2022 [SP.critical]: Filer Reboots
Record 3864: Thu Jul 14 01:28:26 2022 [SP.critical]: Filer Reboots
Record 3877: Thu Jul 14 07:43:41 2022 [SP.critical]: Filer Reboots
Record 3892: Sat Jul 16 12:42:43 2022 [SP.critical]: Filer Reboots
Record 3906: Mon Jul 18 03:35:09 2022 [SP.critical]: Filer Reboots
Record 3919: Mon Jul 18 04:23:21 2022 [SP.critical]: Filer Reboots
Record 3933: Tue Jul 19 04:24:35 2022 [SP.critical]: Filer Reboots
Record 3946: Tue Jul 19 10:15:06 2022 [SP.critical]: Filer Reboots
然后我们再来看重启时候的日志,基本上都是下面的panic信息,都是类似的

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 06
fault virtual address = 0x28
fault code = supervisor read data, page not present
instruction pointer = 0x8:0xffffffff842d9ef3
stack pointer = 0x10:0xfffffe0008764bd0
frame pointer = 0x10:0xfffffe0008764bf8
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 1553 (ontap: cpu2)
trap number = 12
PANIC : page fault (supervisor read, page not present) on VA 0x28 cs:rip = 0x8:0xffffffff842d9ef3 rflags = 0x10246
version: 8.1.2: Tue Oct 30 19:56:51 PDT 2012
conf : x86_64
cpuid = 2
Uptime: 44m28s
PANIC: page fault (supervisor read, page not present) on VA 0x28 cs:rip = 0x8:0xffffffff842d9ef3 rflags = 0x10246 in SK process Auth09 on release 8.1.2 on Mon Jul 18 12:20:08 CST 2022
version: 8.1.2: Tue Oct 30 19:56:51 PDT 2012
compile flags: x86_64
HA: current time (in sk_msecs) 2648596 (in sk_cycles) 4702800163010
DUMPCORE: START
Dumping to disks: 0a.00.11
Writing panic info to sparecore disk.
导致上述panic的根本原因是 NetApp ONTAP 8.1版本的CIFS(SMB 2.0) bug 552397导致。
下面是NetApp对这个bug的官方解释:
If durable handles are enabled and all the following conditions are met, controller disruption might occur: 1. Multiple SMB 2 sessions from a workstation share the same TCP connection; and 2. There are open files on at least one of those sessions; and 3. Network error detected by the workstation triggers session reconnects.

好了,找到panic的问题了,下一步就是解决方案。其实根本的解决方案就是升级Ontap操作系统,至少到 8.1.4P9以后。如果不想升级,就这样对付,但也不想让panic天天发生,就是disable durable handles的属性或者干脆把smb 2.0 或者2.1都disable掉,不使用。
详细解决方案可以咨询博主@ wechat : StorageExpert.
边栏推荐
- Wenxin big model raises a new "sail", and the tide of industrial application has arrived
- Tdengine helps Siemens' lightweight digital solution simicas simplify data processing process
- WordPress website SEO complete tutorial
- 认识传输层协议—TCP/UDP
- 微信小程序之性能优化(分包、运行流程细节、精简结构、原生组件通信)
- 原生组件、小程序与客户端通信原理、video、map、canvas、picker等运行原理
- Spark partition operators partitionby, coalesce, repartition
- [machine learning basics] common operations of Feature Engineering
- [bdsec CTF 2022] partial WP
- One year after graduation, I gave up the internship opportunity and taught myself software testing at home. The internship of my classmates has just ended. I have become a 12K monthly salary testing e
猜你喜欢
![[重要通知]星球线上培训第三期来袭!讲解如何在QTYX上构建自己的量化策略!...](/img/37/f9ea9af069f62cadff21415f070223.png)
[重要通知]星球线上培训第三期来袭!讲解如何在QTYX上构建自己的量化策略!...

毕业设计校园信息发布平台网站源码

After five years of contact with nearly 100 bosses, as a headhunter, I found that the secret of promotion was only four words

1000 okaleido tiger launched binance NFT, triggering a rush to buy

Ora-12899 error caused by nchar character

BPG笔记(三)
深入理解微信小程序的底层框架(二)组件系统、Exparser

LeetCode 70爬楼梯、199二叉树的右视图、232用栈实现队列、143重排链表

氢能创业大赛 | 国华投资董事长刘小奇:发挥风光氢储融一体化优势 高水平承办创业大赛

Canvas drawing (mouse click to draw and lift to end)
随机推荐
WordPress website SEO complete tutorial
Detailed comparison between graphic array and linked list, performance test
One year after graduation, I gave up the internship opportunity and taught myself software testing at home. The internship of my classmates has just ended. I have become a 12K monthly salary testing e
毕业设计校园信息发布平台网站源码
Build a CPU Simulator
On Domain Driven Design
Tdengine helps Siemens' lightweight digital solution simicas simplify data processing process
Installation, configuration and use of sentry
NETCORE - how to ensure that icollection or list privatization is not externally modified?
Summary of volatile interview in concurrent programming
View Binding 混淆问题。我两天都在研究混淆。
Is it safe for Huatai Securities to open an account? Is it true? Is it formal
Share two interesting special effects
通过Arduino IDE向闪存文件系统上传文件
浅谈元宇宙中DeFi的可能性和局限性
What is naked SQL? What middleware or plug-in is good for express to operate MySQL?
Hundred million financing events account for more than 30%. Where is the next stop for super automation? -- Manfu Technology
On the possibility and limitation of defi in the metauniverse
Local empowerment learning
Network protocol details: UDP