当前位置:网站首页>一次 MySQL 误操作导致的事故,「高可用」都顶不住了!
一次 MySQL 误操作导致的事故,「高可用」都顶不住了!
2022-06-24 03:45:00 【InfoQ】


事故现场
- 环境:测试环境
- 时间:上午 10:30
- 反馈人员:测试群,炸锅了,研发同事初步排查后,发现可能是数据库问题。
系统部署图


报错原因和解决方案
- ① 我第一个想法就是,不是有 Keepalived 来保证高可用么,即使 MySQL 挂了,也可以通过 Keepalived 来自动重启才对。即使一台重启不起来,还有另外一台可以用的吧?
- ② 那就到服务器上看下 MySQL 容器的状态吧。到 MySQL 的两台服务器上,先看下 MySQL 容器的状态,docker ps 命令,发现两台 MySQL 容器都不在列表中,这代表容器没正常运行。

- ③ 这不可能,我可是安装了 Keepalived 高可用组件的,难道 Keepalived 也挂了?
- ④ 赶紧检查一波 Keepalived,发现两台 Keepalived 是正常运行的。通过执行命令查看:systemctl status keepalived

- ⑤ 纳尼,Keepalived 也是正常的, Keepalived 每隔几秒会重启 MySQL,可能我在那一小段空闲时间没看到 MySQL 容器启动?换个命令执行下,docker ps -a,列出所有容器的状态。可以看到 MySQL 启动后又退出了,说明 MySQL 确实是在重启。

- ⑥ 那说明 Keepalived 虽然重启了 MySQL 容器,但是 MySQL 自身有问题,那 Keepalived 的高可用也没办法了。
- ⑦ 那怎么整?只能看下 MySQL 报什么错了。执行查看容器日志的命令。docker logs <容器 id>。找到最近发生的日志:

- ⑧ 提示 mysql-bin.index 文件不存在,这个文件是配置在主从同步那里的,在 my.cnf 配置里面。

mysql-bin.xxxmysql-bin.indexbinlog
mysql-bin.index/var/lib/mysql/log/mysql-bin.000001
mysql-bin.000001mkdir logchmod 777 log -R

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'
- Slave_IO_Running: NO,当前同步的 I/O 线程没有运行,这个 I/O 线程是从库的,它会去请求主库的 binlog,并将得到的 binlog 写到本地的 relay-log (中继日志) 文件中。没有运行,则代表从库同步是没有正常运行。
- Master_Log_File: mysql-bin.000014,说明当前同步的日志文件为
000014,之前我们看到节点 node56 上 mysql.index 里面写的是 000001,这个 000014 根本就不在 index 文件里面,所以就会报错了。


FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS
UNLOCK TABLES
# 停止从库同步STOP SLAVE;# 设置同步文件和位置CHANGE MASTER TO MASTER_HOST='10.2.1.55',MASTER_PORT=3306,MASTER_USER='vagrant',MASTER_PASSWORD='vagrant',MASTER_LOG_FILE='mysql-bin.000001',MASTER_LOG_POS=117748;# 开启同步START SLAVE;

为什么会出现问题?

改进
边栏推荐
- Pits encountered in refactoring code (1)
- How to install CentOS 6.5 PHP extension
- Gaussian beam and its matlab simulation
- The quick login of QQ cannot be directly invoked through remote login, and the automatic login of QQ can be invoked using VNC
- Actual battle case | refuse information disclosure, Tencent cloud helps e-commerce fight against web crawlers
- Ar 3D map technology
- Record the creation process of a joke widget (II)
- Use lightweight application server to automatically download and upload to onedrive
- 黑帽SEO实战之通用301权重pr劫持
- What is distributed configuration center Nacos? What are the functions of distributed configuration center Nacos?
猜你喜欢
随机推荐
Typera cooperates with picgo to upload pictures to its own server with one click and obtain external links at the same time
Chapter 6: UART echo case of PS bare metal and FreeRTOS case development
Web penetration test - 5. Brute force cracking vulnerability - (4) telnet password cracking
Mocktio usage (Part 2)
Summary of rust high concurrency programming
QT creator tips
Understand Devops from the perspective of leader
On game safety (I)
Differences between EDI and VMI
How to handle the uplink and downlink silence of TRTC
hprofStringCache
内存泄漏之KOOM
Why use code signing? What certificates are required for code signing?
【代码随想录-动态规划】T392.判断子序列
TRTC audio quality problem
How to use elastic scaling in cloud computing? What are the functions?
Grpc: how to make grpc provide swagger UI?
Industrial security experts talk about how to build security protection capability for government big data platform?
Tens of millions of Android infected with malicious virus and Microsoft disabled a function of Excel | global network security hotspot on October 9
Summary of common problems of real-time audio and video TRTC - quality







