当前位置:网站首页>Station B collapsed. Let's talk to the injured programmers
Station B collapsed. Let's talk to the injured programmers
2022-06-24 06:33:00 【Programmer fish skin】
No melon ,B Analysis of the whole station incident + Prevention and control technology sharing
Hello everyone , I'm fish skin , I believe many friends have heard about the collapse of Xiaopo station yesterday .
If it's not before , If I don't like melons, I won't pay attention to this kind of thing at all , If it falls, it falls , Anyway, when the sky falls down, it's shouldered by programmers , It'll be all right soon .
But this time it's different , Because I've become part of this event myself “ The victim ” !
So today, from the perspective of a programmer , Let's review B The station collapsed the whole story of the incident 、 Reason 、 And share some control technology and harvest insights .
From the beginning to the end
B The station just collapsed , But not yet , I'm writing a little code in the studio 、 Friendly communication with friends . Because I don't often watch barrage when I write code , I didn't notice that the barrage didn't move , There's no one to go with the barrage .
At first I thought it was boring just to write my own code , No one paid attention to me . And then I put it there and I muttered to myself : Strange , How come there's no one to accompany me to launch the barrage ? feed , Anyone? ?Hello?Hi? It's better than eight steps ?
Only later did I find out , The barrage area doesn't even have a room entry prompt , It's impossible that nobody came in for a few minutes ? Something must have happened !
I thought it was the barrage , So we closed the barrage and opened it again , The result is the same . Then I wanted to restart the live broadcast , It turned out that it couldn't be opened after it was turned off , It's on the screen : Seems to be disconnected from the server .
To be honest , Before that , I never thought of that. B The platform with 100 million level traffic will collapse . So the first reaction is the same as everyone else , They all suspect that it's their own network , It turns out that web pages can be opened , I can't connect to the Internet . So I suddenly thought about it, and I was scared : Is the grass ?B The station even blocked me ?( The old wanted man )
this is it , I was a victim at the scene of the accident , It's the one who fell on the ground , So until more than ten minutes after the accident , It's only through other channels that I've learned , Oh , Turned out to be B There's something wrong with the station .
Even though I missed the first scene , But through hot search , You can also learn that B The general process of the station crash , In short , Is in the A few hours Inside , The user is not able to access... Properly B Any function of the station !
open B standing , First, 404 Not Found Resource not found :
And then there was 502 Bad Gateway :
1 After one hour , Some little friends said B Some of the functions of the station are already available , But it's not fully recovered , until 14 Early morning ,B The official station finally responded , It's back to normal .
Reason guess
Last night, I cut the video until early in the morning 2 Some more , I wanted to go straight to bed , But the hand is cheap and open again , Find out “B The station collapsed ” yes Top 1 Hot issues , Out of curiosity, I just want to know the real reason behind the accident , Take a look at your opinions .
Originally, I was not B The outsider working in the station , No deep understanding of its technical architecture ; Plus the lack of key information 、 There is no reliable proof of conjecture , So I'm not going to express my opinion . It turns out that few programmers in the front row are inferring the cause of the accident from a technical point of view , They are all small answers to help you eat more delicious melons . Then I might as well learn the architecture knowledge from the past , Do a wave of speculation , If I hit it, I'll be surprised .
Actually in 20 In the year ,B Mr. Mao Jian, technical director of the station, is in Tencent cloud + The community has shared 《B Site high availability architecture practice 》 Lecture , At that time, I watched it all the way , But I didn't think , one day , Highly available B The station is not available .
So before this analysis , I'll start with 《B Site high availability architecture practice 》 Read the article again , Interestingly , Just half a day , The reading of this article has increased 15 ten thousand !
And what's more interesting is , There's a lot more at the bottom of the article “ Taunt ”, what “ Eight part essay architect ” And so on. :
But I don't think it's necessary , Because the technology shared by Mr. Mao Jian is really a practical high availability solution , It's just that there's a lack of confirmation .
Article address :https://cloud.tencent.com/developer/article/1618923
Here's my guess .
guess 1: The gateway is down
First , When the accident happened at this small broken station , Other sites have collapsed ! such as A standing 、 Jinjiang 、 douban , It's all hot .
These accidents happen at the same time , It shows that there is something wrong with the public services these systems rely on , And the only thing that's capable of causing massive service disruption is CDN 了 .
CDN It's a content distribution network , Send the content of the source station to the server nodes in each region in advance , Then users in different regions can get the content nearby , It's not all from the source , So as to accelerate the content 、 The role of load balancing .
once CDN Hang up , All the traffic of users in this area will go to the gateway :
Gateways are like family leaders , If the user has a need, tell the boss , Then the boss assigns the needs to the brothers to complete .
Besides , Gateways usually also take on the mission of protecting and serving younger brothers , Unified load balancing 、 Control flow 、 Fuse degradation, etc .
Logically speaking , In general, gateways not only protect downstream Services , They also need security protection . But why didn't the gateway protect itself ?
My guess is : The gateway hasn't come yet and the protection measures have not been opened ( Its own fusing and degradation, etc ), I was killed by the traffic .
When the gateway hangs up , The service has no father , The service is missing a call entry , Naturally, it's not available , Not all services behind the gateway are paralyzed .
guess 2: Service avalanche
Another guess is B There are many services in the station system Call chain . because CDN Or part of the machine goes down , Leading to a downstream service A The execution time of is increased , This results in the upstream invoking the service A Service for B The execution time also increases , Make the processing capacity of the system per unit time worse . Coupled with the continuous backlog of upstream requests , Eventually, the whole call chain avalanches , All chain services, from son to father, are destroyed .
A popular example is that the toilet at home is blocked , The bucket hasn't been filled yet , But it's still up there “ Delivery ”, The end result is that you can't “ Delivery ” 了 , The toilet burst !
Official explanation
After the official explanation was that the server room broke down , I read the analysis of other teachers , I feel that the official explanation is still in the past .
It's true that before B When sharing the high availability architecture with others, I hardly mentioned Disaster preparedness and How to live Aspect design , It's more in the local service layer and application layer , Such as current limiting 、 Downgrade 、 Fuse 、 retry 、 Timeout processing, etc , So when designing large-scale distributed system, we should consider more comprehensively , Take warning ~
Until before the article , You know Top 1 The respondents carefully sorted out the clues :
Why did the other two recover quickly ,B It took several hours for the station to return to normal ?
How much and B Station self research components are related to , On the one hand, it is influenced by cloud service providers , Leading to the downstream service chain down , The fault area is large ; On the other hand, it takes time to restart , And during the restart process , The upstream load balancing may not be able to withstand the traffic peak , So to get back to normal , At least wait for many container copies to restart completely .
And yesterday 23 around , I opened the B When standing , What you see is old data from a few hours ago , Explain this time B Some service copies have been restarted , And started a downgrade , There's no real data query .
I didn't expect that my answer was still a little angry , For the first time Ten million views The problem of Top 2, flattered , flattered ...
save : The above is my guess , Limited expertise , Welcome to the comments section , Light spray light spray .
Prevention Technology
Let's talk about the prevention technology of service failure , How to ensure the high availability of services , Try to continue to provide services to users without downtime .
I'm going to learn about a simple classification of technologies , It's a mind map :
For the moment, I think of so many , Of course, there are other technologies .
Time is limited , Let's not talk about these technologies first . About how to reduce the system Bug、 Ensure high availability of services , Welcome to my history article : The sword of Damocles in software development , Many of the above technologies are also explained .
Harvest sentiment
About this accident , I'm one of the victims , There are also some gains and insights , Instead of eating melon and loneliness .
The first is to have Questioning the spirit , When we're having trouble writing the program , It's no problem to habitually find the reason from yourself first , But my own investigation did not find Bug after , We should be bold to speculate that it's the class library we use 、 Components 、 Or rely on services 、 There may even be something wrong with the editor , Instead of thinking that something famous must be right . Like when something goes wrong with Xiaopo station , I suspected that my live broadcast was blocked. Ha ha , I almost want to find the manager to kneel down .
programmatically , We can't just memorize knowledge 、 Listen to others , do Eight part essay architect ; It's about being an experienced engineer , Don't blindly believe in 、 Don't take it for granted , It's about accumulating experience in practice 、 Combine with practice to optimize the system .
Through this analysis combined with the actual fault process , I also reviewed the architecture knowledge I learned before , Have a deeper understanding of some high availability designs . One day, , Try not to let Programming navigation (www.code-nav.cn) Be the next B standing ( dog's head ).
There is also the above mentioned , Always be prepared for danger in times of peace , Develop a good habit of defensive programming , Instead of trying to fix something when something goes wrong . image B Standing on this famous platform , A little bit of a problem , For users 、 The loss to the enterprise is immeasurable .
thank B One day big member compensation from my father ️
Finally, I'll send you some more Help me get to the big factory offer Learning materials :
ran , leave 6T Resources for !
How I started from scratch through self-study , Get Tencent 、 Byte and other big factories offer Of , You can read this article , No more confusion !
I studied computer for four years , Mutual encouragement !
I'm fish skin , give the thumbs-up It's still a request , I wish you all the best 、 Make a fortune 、 Universiade .
边栏推荐
- Reasons for automatic allocation failure of crawler agent IP
- The errorcontrol registry of the third-party service is 3, which may cause the system to cycle restart. For example, ldpkit introduced by WPS
- How to check whether the domain name is filed? Must the domain name be filed for use?
- Continuously evolving cloud native application delivery
- SQL server memory management on cloud
- Easynvr is optimized when a large number of videos are not online or unstable due to streaming failure
- Flexible use of distributed locks to solve the problem of repeated data insertion
- CLB unable to access / access timeout troubleshooting
- The influence of TLS protocol and cipher on remote RDP
- Analysis of official template of wechat personnel recruitment management system (I)
猜你喜欢
A cigarette of time to talk with you about how novices transform from functional testing to advanced automated testing
Fault analysis | using --force to batch import data leads to partial data loss
Manual for automatic testing and learning of anti stepping pits, one for each tester
The product layout is strengthened, the transformation of digital intelligence is accelerated, and FAW Toyota has hit 2022million annual sales
[fault announcement] one stored procedure brings down the entire database
Technology is a double-edged sword, which needs to be well kept
解读AI机器人产业发展的顶层设计
Oracle case: ohasd crash on AIX
Enter the software test pit!!! Software testing tools commonly used by software testers software recommendations
ServiceStack. Source code analysis of redis (connection and connection pool)
随机推荐
How to build a website after having a domain name? Can you ask others to help register the domain name
Excellent tech sharing | research and application of Tencent excellent map in weak surveillance target location
How to build a website with a domain name? Is the domain name very cheap
Web automation test (3): Selenium basic course of web function automation test
How accurate are the two common methods of domain name IP query
Replacing human eyes -- visual inspection technology
Nine possibilities of high CPU utilization
Enter the software test pit!!! Software testing tools commonly used by software testers software recommendations
How does go limit the flow of services?
Tencent cloud won the "best customer value award for security hosting services in China" from Sullivan toubao Research Institute
What is the domain name query network and how it should be used
Raspberry PI (bullseye) replacement method of Alibaba cloud source
Multi objective Optimization Practice Based on esmm model -- shopping mall
From home to Ali, a year for junior students to apply for jobs
The installation method of apache+mysql+php running environment under Windows
解读AI机器人产业发展的顶层设计
Manual for automatic testing and learning of anti stepping pits, one for each tester
Network Overview
Domain name, resolution, SSL certificate FAQ
Clickhouse alter table execution process