当前位置:网站首页>Practical operation: elegant downtime under large-scale micro service architecture
Practical operation: elegant downtime under large-scale micro service architecture
2022-07-25 07:34:00 【51CTO】
The authors introduce
Sandun SRE, Mobile changes life , Technology affects the future ; My three tricks , my IT.
Problem description
After large-scale micro service transformation , During the day, when the service instance is restarted or expanded , The system often appears “Can not get connection to server” Report errors ( This error is an exception thrown by the microservice framework , Indicates that the client cannot access the assigned server ), Some users' business acceptance failed , Affect user perception .
Problem analysis
The micro service framework of our production environment is through Zookeeper Service registration found . Here's the picture 1 Shown :

chart 1: Service registration call flow chart
Its main calling logic is :
- Container generation , The service starts in the container ;
- Register to the service router (Zookeeper);
- Service callers subscribe to service routers ;
- Registration changes occurred on the service router , Notify the service caller to retrieve the new registration list ;
- According to the obtained server list, the service caller , Make service calls .
But when the server instance stops , Because the server will not take the initiative to change the registration information on the service router , The client needs 40 second ( Currently applied to Zookeeper Session timeout configuration ) To eliminate this abnormal configuration , Here 40 Within seconds, the application will continue to try to access this nonexistent instance , This leads to a large number of business errors .
This is also related to every instance restart , The duration of error reporting is consistent . And because of the characteristics of microservices , Every actual business request , The same service will be called many times according to business needs , This increases the possibility of accessing exception instances for each business request , Increased the probability of business failure .
So this is due to the violent stop of the service instance , It is magnified by the multiple visits of a single business request of the microservice architecture , The problems that arise .
Solution
Since this is the problem caused by the violent shutdown of the service instance , So we began to study elegant downtime based on microservices .
Application graceful downtime in micro service architecture , It mainly refers to that the application instance is planned and smooth ( That is, there is no action to be handled or no abnormal error report ) How to exit . There are two main ways :
- Mode one : Through the self-contained detection capability of the microservice framework , If in Spring Cloud In the microservice framework , Provides actuator Component's /health Endpoint to achieve . The client needs to implement a custom HealthCheckHandler, It saves the health state of the application to memory , Just use it on the server curl send out shutdown command , Once the state changes , It will re register with the server .
- Mode two : By registering JDK Of ShutdownHook( hook ) To achieve , When the system receives the exit instruction , First, remove yourself from Zookeeper Registration server online and offline , No more new messages , And then deal with the backlog of requests , Finally, call the resource recycling interface to destroy the resources , Finally, each thread exits execution .
Because our production environment does not adopt open source universal microservice architecture , And applications are based on JAVA Development , Therefore, we adopt mode 2 : By registering JDK Of ShutdownHook( Close the hook ) To achieve elegant downtime .
Closing a hook is essentially a thread ( Also known as Hook Threads ), Used to monitor JVM The closing of the . adopt Runtime Of addShutdownHook Can be directed to JVM Sign up for a close hook .Hook The thread is in JVM Only after normal shutdown , Forced shutdown will not be executed .
JVM Normal shutdown scenarios mainly include the following :
- Java When the program exits after normal operation, it will be called ;
- Through the terminal ctrl-c It will be called when terminating the command ;
- JVM happen OutOfMemory And it will be called when exiting ;
- Java In process execution System.exit() Will be called ;
- It will be called when the operating system shuts down ;
- linux adopt kill pid( perhaps kill -15 pid) It will be called when the process ends .
JDK in ShutdownHook The relevant source code is shown in the figure below 2 Shown :

chart 2: Add or delete implementation
ShutdownHook How to be called ? Use java.lang.Runtime.addShutdownHook Method , You can register a JVM Closed hook ( Threads ), Here's the picture 3 Shown . We want the program to be in JVM All kinds of finishing work when exiting , such as : close resource 、 Log off the registration information on the service router 、 Waiting for the completion of in transit request processing, you can add an implementation in this thread .

chart 3: Sign up for a JVM Example of closed hook
Of course, graceful exit requires a timeout control mechanism , If the resource recycling and other operations before exiting are still not completed when the timeout period is reached , The shutdown script directly calls KILL -9 PID Or call the code method of forcibly closing the process to forcibly exit , Otherwise, it may take a long time , Affect our normal start and stop operation .

Other precautions
1、 In the following scenarios , Will stop directly JVM process ,JVM There is no chance to perform the outstanding work in the close hook thread , Graceful downtime cannot be achieved :
- kill -9(SIGKILL The signal );
- Called java.lang.runtime.halt() Method ;
- The mainframe is direct crash;
- The host is shut down directly ;
- Host memory ( Or container memory ) Not enough , Trigger the operating system OOM-KILLER.
2、hook Threads will delay JVM The closing time of , So reduce the execution time as much as possible , And do a good job of overtime control .
effect
After code optimization , Through the verification of the test environment and the actual production practice of the fixed scenario , Containers are destroyed normally 、 Restart time , None of them appeared “Can not get connection to server” The error of , It also solves the problem of business perception . After solving this problem , In the context of large-scale microservice Architecture , Automatic self-healing of containers 、 Functions such as expansion and contraction can be used again .
summary
There is no optimal solution for the elegant downtime of microservices , As long as we grasp the core idea to design . If there is such a solution in the framework used , Recommended direct use , Its adaptability is definitely the highest . In the microservices architecture , We can follow the following suggested rules to design an elegant shutdown mechanism for microservices :
- All microservice applications should support graceful downtime ;
- Priority to cancel the service instance registered in the registry ;
- The access point of the service application to be shut down is marked as denial of service ;
- Upstream services support failover services rejected due to graceful downtime ;
- Appropriate shutdown interfaces are also provided according to specific business .
Elegant operation and maintenance are inseparable from early Automation 、 Intelligent design , These design ideas are 2020 DAMS China data intelligent management summit It will be presented as a real case , For your reference :
- 《 Suning large-scale intelligent alarm convergence and alarm root cause practice 》 Suning Technology Group Director of Cloud Computing Soup swimming
- 《 Ping An Bank “ Tradition + Internet ” blend CMDB And operation in Taiwan practice 》 Ping An Bank Head of operations development Xu Dawei
- 《 China CITIC Bank DevOps practice 》 China CITIC Bank DevOps Implementation lead Lihongtao
- 《 Alibaba's large-scale container cloud infrastructure environment architecture 、 Management and operation and maintenance 》 Alibaba Senior technical expert Yao Jie ( Hello )

边栏推荐
- How to use network installation to deploy multiple virtual servers in KVM environment
- list的模拟实现
- Flinkcdc2.0 uses flinksql to collect MySQL
- Alibaba cloud image address & Netease cloud image
- J1 common DOS commands (P25)
- Beijing internal promotion | Microsoft STCA recruits nlp/ir/dl research interns (remote)
- Teach you to use cann to convert photos into cartoon style
- 新库上线| CnOpenDataA股上市公司股东信息数据
- 【PyTorch】最常见的view的作用
- BOM overview
猜你喜欢

【Unity入门计划】基本概念-2D刚体Rigidbody 2D

冰冰学习笔记:类与对象(上)

Luo min's backwater battle in qudian

【Unity入门计划】基本概念-GameObject&Components
![[software testing] package resume from these points to improve the pass rate](/img/69/b27255c303150430df467ff3b5cd08.gif)
[software testing] package resume from these points to improve the pass rate

华为无线设备配置WPA2-802.1X-AES安全策略

RPC communication principle and project technology selection

cesium简介
![[unity entry plan] interface Introduction (1) -scene view](/img/88/dee292cb90cd740640018e7260107f.png)
[unity entry plan] interface Introduction (1) -scene view

"Game illustrated book": a memoir dedicated to game players
随机推荐
Line generation (matrix ')
Load capacity - sorting out the mind map that affects load capacity
Beijing internal promotion | Microsoft STCA recruits nlp/ir/dl research interns (remote)
JS note 17: the whole process of jest project configuration of typescript project
【Unity入门计划】基本概念-2D刚体Rigidbody 2D
[300 + selected interview questions from big companies continued to share] big data operation and maintenance sharp knife interview question column (V)
[unity introduction program] basic concepts GameObject & components
[dynamic programming] - Knapsack model
DJI push code (one code for one use, limited time push)
3. Promise
[unity entry program] make my first little game
QT学习日记20——飞机大战项目
How does uxdb extract hours, minutes and seconds from date values?
Talk about programmers learning English again
关于GBase 自动关闭连接问题
Paddlepaddle 34 adjust the layer structure and forward process of the model (realize the addition, deletion, modification and forward modification of the layer)
Leetcode118. Yanghui triangle
[ES6] function parameters, symbol data types, iterators and generators
【云原生】原来2020.0.X版本开始的OpenFeign底层不再使用Ribbon了
转行学什么成为了一大部分人的难题,那么为什么很多人学习软件测试呢?