当前位置：网站首页>Disaster recovery series (V) -- database disaster recovery construction

Disaster recovery series (V) -- database disaster recovery construction

2022-06-24 02:58:00 【Kaiyuan】

In an age when data is king , Data security is regarded as the lifeblood of an enterprise , Therefore, how to ensure enterprise data security is particularly important . This paper mainly from the perspective of database disaster recovery scheme , Based on current customer business and combined with technology & product , Make the best disaster recovery plan . It is mainly introduced from the following three aspects ：

Elements of scheme design
Cloud disaster recovery scheme
Cloud customer cases

1. Elements of scheme design

Main data synchronization of database disaster recovery scheme design elements , Data consistency and data repair .

1.1 Data synchronization

Data synchronization mainly refers to the data synchronization between two availability zones or different regions , It is mainly divided into one-way synchronization and two-way synchronization .

Synchronization mode	The principle of replication	Specific scenarios	advantage	Inferiority
One way synchronization	The master-slave copy mode is adopted	Upper level business order writing mode	1. Low data consistency challenges 2. Minor business changes	Business delay is highly dependent
Two way synchronization	The master master bidirectional replication mode is adopted	Single write or double write mode of upper layer business	Business delay dependency is weak	1. Data consistency challenges are high 2. Major business changes

Describe the scenario of bi-directional synchronization ：

Business traffic a single game service is hosted in a single region , As shown in the figure below, generally , The services in the red box do not carry traffic , But as a business emergency escape route ; At the same time, the database is read and written in the same region , Two way synchronization between different regions .

Database write only , Two way synchronization scenario

1.2 Data consistency

Data consistency , It mainly refers to the upper layer business when reading the database , The data stored in the master and slave databases of the database set group shall be consistent . If the data in different database clusters are inconsistent , The business will read dirty data . for instance , Data inconsistency exists in the bank balance database , One day, Xiao Wang, a white-collar worker, just got paid , At the same time, the balance is notified by SMS 3 Ten thousand yuan , But when Xiao Wang logs into the bank APP The query found that the balance was only 2000 element , The newly added balance is not synchronized to all primary and secondary databases in time , Leading to data inconsistency ; Imagine this situation , How many customer service staff does the bank need to support ？

1. Write a business scenario

Write a business scenario , It shows that the business has only one database system , Therefore, the consistency guarantee depends on the master-slave database replication mode in the database cluster , Including asynchronous , Half a synchronous , And strong synchronization . Normally speaking , The final consistency of data is adopted for general business , Half synchronous replication is the most popular choice .

Copy way	Technical principle	Uniformity	performance
Asynchronous replication	The application initiates an update request , The master node responds to the application immediately after completing the corresponding operations , The master node asynchronously replicates data to the slave node .	weak	strong
Semi-synchronous replication	The application initiates an update request , After the master node performs the update operation, it immediately reports to From the node Copy the data , Data is received from the node and written to relay log in （ No execution required ） Then the success information is returned to the master node , The master node must return a response to the application after receiving the success information from the slave .	in	in
Strong synchronization	The application initiates an update request , After the master node completes the operation, it copies data to the slave node , Data received from the node is sent back to The master node returns success information , The master node will respond to the application after receiving the feedback from the slave node .Master towards Slave Copying data is synchronous	strong	weak

2. For the double write business scenario

Business double write , It shows that the business system has two sets of database clusters , On the one hand, the data consistency guarantee is the data replication mode in the cluster , On the other hand, the data replication between two clusters . The double write data consistency guarantee mainly depends on the business layer . The following figure shows the mainstream database double write scheme in the industry ,

1） It is divided into different categories according to user information IDC Computer room , adopt API The gateway forwards different users to different IDC colony

2） database mysql The data has been divided into units , Dual entry writable , But the same user data can only be accessed at one portal , To ensure the consistency of read data .

3） If the data conflicts , The system can overwrite the old data in time stamp order .

The mainstream database double write scheme in the industry

1.3 Data recovery

After the database cluster breaks down , How to ensure data consistency .

1） Interrupt scene awareness and switching

It is usually detected through the arbitration center , During the detection period , The master node exception is found , Conduct VIP Handoff .MHA As a mature database high availability fault solution in the industry ; Tencent cloud adopts ZK Way to perceive switching , After testing, prepare to switch about 30s complete .

2） Interrupt scenario consistency guarantee

be based on 1.2 The chapter describes how to ensure data consistency , It mainly depends on the actual business scenarios for reinforcement . Generally speaking , Business that is not very sensitive to data , The master-slave switch does not need to compare the data consistency . If a business is sensitive to data consistency requirements , Generally, there is an internal full calibration tool to verify , If inconsistencies are found , The automatic repair is overwritten by time stamping according to the established principles , Or analyze manual processing through local logs to ensure data security .

2. Platform disaster recovery scheme

The most commonly used Tencent cloud data products for customer business scenarios are redis,cdb,mongoDB as well as TDSQL.

Data products	Trans regional disaster recovery	Visit nearby	Cross region disaster recovery
CDB	Support Console self-service configuration	Support Span AZ/ Cross region RO example	Scheme 1 ： adopt DTS Support , Manual business switching is required VIP Option two ： Support DTS Dual writing ability , Above and below the cloud or in many places .
redis	Support Console self-service configuration	Support Span AZ/ Regional copy	Scheme 1 ： adopt DTS Replication support Option two ： adopt DTS Support global replication capabilities , Read from many places nearby
TDSQL	Support Console self-service configuration	Support automatic separation of read and write Span AZ/ Cross region	adopt DCN Replication to support
MongoDB	Support Console self-service configuration	Support Span AZ copy	adopt DTS Replication support

3. Cloud customer cases

At present, a financial company on the cloud , Use cloud TDSQL product , The data stored in the database is the order business , The current single availability capability needs to be upgraded to the multi availability zone capability . Upgrade to multi zone capability at the same time , The following risk factors will be introduced

The service delay will be 3ms Left and right network delay ,tdsql stay proxy To db There is no principle of proximity
In extreme cases, the probability of the master-slave consistency problem increases
Network jitter across availability zones will lead to write services hang live

The same region is different AZ Will exist 3ms Network delay , For disaster recovery, it is recommended to choose performance here , about 2 and 3 The appeal point of , combination tdsql The product provides disaster recovery suggestions .

Based on the present tdsql The core database adopts a single availability zone, a master and two slaves architecture , The data replication mode is strong synchronization , There are three main schemes , Comprehensive consideration Scheme III is adopted ：

among TDSQL Strong synchronization description ：https://cloud.tencent.com/document/product/557/10570

programme	Details of the scheme	advantage	Inferiority
Scheme 1	Dual zone deployment ： One availability zone, one master and one slave , Another available zone is from	1. Business delay ： The service delay is less affected by the cross availability zone delay , It is almost the same as the delay in the same zone , From the theory of strong synchronization, most ACK All by Slave1 Return to Master node . 2. Writing data hang live ： Consistent with the business scenario of the same availability zone .	1. Data consistency ： Poor data consistency , Strong synchronization depends on Slave1, about slave2 The data may not be up to date , There may be data inconsistency in the availability zone 2. Read dirty data ：AZ1 and AZ2 Network exceptions across availability zones , When ZK Eliminate in judgment slave2 period （20s）, Read only services are available in slave2 The probability of reading expired dirty data （ Delay sensitive business , It is also not recommended to read the slave node data ）
Option two	Dual zone deployment ： One zone and one master , Another zone has two slaves	1. Data consistency ：master Zone failure , According to the strong synchronization rule , Ensure the final consistency of data . 2. Read dirty data ： In theory, two slave return ack The time difference is small , Therefore, the network across availability zones is abnormal , In two slave The probability of a node reading dirty data is very low .	1. Business delay ： Span AZ There will be 3ms Network delay , The business is comprehensively evaluated in combination with specific affairs . 2. Writing data hang live ： There is only one logical link across the availability zone , It depends on the link stability between availability zones , Writing data will be added hang Probability of staying .
Option three	3. Availability zone deployment ： Three zones , One node per zone	1. Data consistency ：master Zone failure , According to the strong synchronization rule , Ensure the final consistency of data . 2. Read dirty data ： In theory, two slave return ack The time difference is small , Therefore, the network across availability zones is abnormal , In two slave The probability of a node reading dirty data is very low . 3. Writing data hang live ： There are two logical links across the availability zone , Enhanced span AZ Network stability , It will reduce the cost of writing data hang Probability of staying	Span AZ There will be 3ms Network delay , The business is comprehensively evaluated in combination with specific affairs

原网站

版权声明
本文为[Kaiyuan]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/10/20211021193328699w.html