当前位置：网站首页>Design of routing service for multi Activity Architecture Design

Design of routing service for multi Activity Architecture Design

2022-06-23 11:58:00 【InfoQ】

One 、 background

With the development of the company's business , The impact of each stability failure is increasing , Provide stable service , Ensuring the high availability of the system has become a problem faced by the entire technology department . Based on this background , The company has carried out multi cloud / More active technical projects , I have the honor to participate in “ The next day ” project 【1】 The design of the transformation scheme of remote double living . I would like to talk about my understanding of some technical solutions for multi life and even globalization .

I will follow the overall technical scheme for the articles of the multi active Architecture series 、 Double live / Global regional deployment technology 、 Network scheduling technology 、 Performance optimization and SRE Five parts . Poison house Blog The meeting will focus on the overall technical scheme and double living / Routing service design module in global regional deployment technology , And will be in the follow-up poison house Blog Gradually improve the complete technical scheme of multi active architecture .

Two 、 How to live / The technical requirements of globalization

In addition to satisfying the users' requirements for performance, network services should also be provided 、 In addition to the basic requirements for availability , How to live / Compliance has also been added in the context of globalization 、 Data isolation and other requirements . And this aspect of the requirements have encountered new challenges .

1.  performance

The user starts the request and receives it 、 The shorter the response time , Represents better performance . But in double living / In the context of Globalization , The user may be in Japan , The computer room may be in China , The physical distance becomes longer , The response time of corresponding services will also be proportional . Test data display , Cross country or large-scale cross machine room calls , Online RTT Will increase

about , And this 1s May result in the conclusion of the transaction “ Conversion rate ” falling , Even the loss of users .

2.  Usability

How to live \ Globalized businesses span time zones , This requires our services to 7️24 Hours are available . This is not just a challenge to the system , It is also a challenge to human resources .

3.  Interconnection

Interworking refers to the physical connection between telecommunication networks . In order to enable users of one telecom operator enterprise to communicate with users of another telecom operator enterprise , At home , This kind of network communication across operators is no longer a big problem . But abroad , The quality of network interconnection in many countries is still not ideal .

4.  Data consistency

When data is shared by global users , Multiple users can read and write , How to ensure data consistency ？

5.  Privacy protection

Global business must comply with GDRP（General Data Protection Regulation, General data protection regulations ）.

6.  Scalability

Wikipedia's explanation ： When the system 、 When the network or process increases or decreases the workload , Have the ability to deal with .

3、 ... and 、 How to live / The overall structure of globalization

3.1  Domain modeling

The core of the system is to deal with the relationship between domain models , Starting from the domain model, let the whole system meet the current and future needs . At the same time, let the project team better cooperate , The following are the core objects of the system .

User： Users on the website platform .

UserGroup： User group 、 Users with the same characteristics will generally adopt the same network link and computer room scheduling strategy , So it will belong to the same user group . actually , Multi active system is a scheduling unit based on user group . The user is a member of the user group , We can schedule by user group , It can also be scheduled for a specific user .

EdgeNode： Edge node . It can be understood as the provider node of static resources . Such as the picture 、js file 、css etc. . General edge nodes refer to CDN.

NetLink： The network link .

NetNode： Network nodes .

PoP： Network service access point .（ Router 、 Switch and other nodes ）

DSA： Dynamic site acceleration , It was an interview by CDN Vendors provide accelerators for dynamic content .

IDC： Computer room .

DNS：DNS The server .

HTTP-DNS： Comprehend APP Terminal DNS The server .

DomainName： domain name .

VIP： fictitious 「 idol 」.

The relationship between the above domain models is as follows ;

3.2  Overall framework

As described above , In fact, we can see , From the user to the computer room , There will be edge node scheduling 、 Network dispatching and computer room dispatching . And the execution of scheduling （ Use of routes ） And the control of dispatch （ Routing generation ）.

Routing refers to the computer room to which each user or user group belongs .

Structure diagram is as follows ;

Global users ： Our system will divide global users into different user groups , And divided into four groups according to the area , namely A continent 、B continent 、C Zhou He D continent . When scheduling execution , The normal process is to push the scheduling information to each user group App On .

Edge Dispatch ： Scheduling of static information , Determines which edge node each user group should use .

Network scheduling ： Every feasible link is counted in real time based on big data, and the decision model determines which route to take .（ It has nothing to do with routing ）

Computer room ： The source of the solution .

Scheduling execution ：PC Use DNS Technology for scheduling ,App Use HTTP-DNS as well as PUSH Technical dispatch .（ Use of routes ）

Dispatching control ： Calculate through real-time data , Determine the specific schedule .（ Routing generation and configuration ）

Four 、 How to live / Global and regional deployment technologies

4.1  The overall architecture

4.1.1  Function priority

The specific scheduling strategy is determined by the business requirements , Generally speaking, compliance will be considered 、 Data consistency 、 Scalability 、 cost 、 Capacity 、 Performance and stability . Usually the order of importance is

compliance > Data consistency > Scalability > other

4.1.2  Deployment architecture

At present, we have four computer rooms （A continent 、B continent 、C continent 、D continent ）, Build regional users corresponding to the regional service where each computer room is located . The data between computer rooms needs to be

Copy on demand

, Each computer room deploys all applications and databases , Make each computer room equal . When the data is backed up , All computer rooms can be used as disaster recovery computer rooms . Data consistency and scalability will be introduced later .

4.1.3  Problem analysis

Trading in the e-commerce scene 、 Take nearby visits and remote disaster recovery as examples . For example, buyers and sellers come from different regions , There must be some shared data consistency problems when trading ; When disaster recovery occurs in other places , It will also face the problem of data consistency ; When users migrate to other areas , Still make sure you visit nearby , Then it involves the data consistency of the same user .

Our strategies are as follows ：

Visit nearby ： The user will route to the fixed computer room （ Under normal circumstances ）, And ensure that the data of users are closed-loop in the same computer room .

Disaster tolerance in other places ： Because applications are peer-to-peer , Make sure that the data exists in the backup machine room .

Buy... All over the world 、 sell ： Synchronize data on demand , Synchronize product information to all computer rooms .

Data consistency ： Ensure single data master principle , That is, only one machine room will change the same data . Will ensure business priorities （ buyers > The seller > operating ）.

4.1.4  Solution

From the perspective of application layering , The solution is shown in the figure ;

4.2  Routing service

The essence of regional deployment technology is multi-layer routing , In each layer of routing , All are based on the route called by the user's corresponding home computer room . The role of the routing service is to tell the caller , Which user does this user belong to .

Routing service structure

Memory routing table ： Understood as a HashMap,key For the user id,value The user belongs to the computer room and the user status .

RPC service .

How to use the routing table

The first application that the user requests to enter the computer room is the same access layer . Use Nginx As a unified access application ,Nginx Embedded routing table , And share in multiple processes .Nginx The first thing to do after accepting the request is to get the user id, Then call the routing table to get the user's home computer room and user status . If the user belongs to the computer room, continue to transmit to the downstream .

The downstream needs the routing information to directly obtain the routing information dropped from the upper layer . Here's the picture ;

There is a time limit for routing transparent transmission , When a certain time limit is exceeded , Transparent content will be invalid . As for the transmission process , What should I do if the user's route changes ？ Follow up answers to this article .

4.2.1  Principle of routing table

The routing table design specification must understand the following points ：

Must be saved in memory .

Guaranteed performance and throughput .

Cannot rely on third-party systems .

The routing design should support free upgrade .

4.2.1.1  Scheme comparison

The scheme comparison includes the following introduction of distributed cache 、HashMap、 Bloon filters, etc , Each of the following schemes has its drawbacks , As follows .

4.2.1.1.1 Introduce distributed cache

defects

All systems must call the remote cache , Strong dependence .

Change of user ownership , The client cache needs to be updated , The remote cache should also be updated .

All systems should add a strong dependency .

4.2.1.1.2  HashMap

defects

preservation 5000 Ten thousand need about 2GB Memory .

4.2.1.1.3  The bloon filter

defects

There is False Positive.

So there seems to be no existing scheme , It needs to be done according to the scene

customized

Routing table .

4.2.1.2  Routing table design

Based on the above inspiration , Choose to use a bit array to store routing information . We can use 4 individual bit To express a user . As shown in the figure ;

In this way, it only needs to store 100 million data 47M Left and right memory space .

But if the user ID The distribution of is segmented ：

0~ 80000000

100000000~ 300000000

700000000~ 800000000

2000000000~ 2000100000

Although the real number of users is only 1 Million or so , however id So widely distributed , This will cost about 900 many M Of memory .

Based on this , To introduce segmentation mode ：

Segmented mode

The segmentation mode is shown in the figure , The core idea is to build a segment index table , A bit sequence is specified on each index table , Used to save user information .（ For example, we use 100 Million users for a period ）. For those index entries that none of the users have , We execute a NULL paragraph . The corresponding bit sequence will not allocate storage space . This greatly saves memory space . In this way, the same user registration needs to be consumed 58 M Left and right storage space .

4.2.1.3  Routing table related design

In the last stage, the basic storage scheme of the routing table was solved , But there are some scenes that need our continuous design and improvement . Now let's think about two questions ：

When a computer room fails and disaster recovery switching is required , If the scheme is implemented based on the existing routing table , You need to change the routing home information of all users in the corresponding computer room , It may involve tens of millions or hundreds of millions of user changes , The cost is very high .

In the double 11 scene , Although the behavior distribution of users can be planned through big data , But only once a year , Few learning samples , It is easy to see that the user behavior is inconsistent with the expectation , That may cause A The capacity of the computer room is insufficient , But the computer room capacity in the United States is very spare . At this point, we need some A European users are shunted to American computer rooms , How can I support it through the current routing table ？ Based on the above scenario, we propose a method called “ Logic machine room ” The concept of .

When all is right , The logical machine room is directly mapped to an original machine room .

When disaster recovery switching occurs , Directly map the logical machine room to the disaster recovery machine room .

When it is necessary to divert some users , According to the user ID Conduct Hash modulus , take Hash Results different users are mapped to different physical machine rooms .

The specific configuration logic can be centrally configured based on the configuration system used by each company .

4.2.2  Routing table update mechanism

The establishment of the routing table update mechanism requires the following design constraints ;

Data consistency ： In the process of routing table change , There is a possibility that the user's home information is inconsistent in different computer rooms or machine nodes .

recoverable 、 Roll back ： No matter what state the system is in , Can definitely return to a desired state .

Rapid change ： In the process of ensuring consistency , Or recover 、 During rollback , Will affect the user experience , You can't even use the system . Therefore, the change process needs to be completed in a very short time .

4.2.2.1  Data consistency thinking

Many times, distributed systems are solving a problem , That is, how to make any record modification effective on all or multiple computer rooms at the same time . The solution is not complicated , And it is universal . Although we cannot guarantee that the change will take effect in all or more computer rooms at the same time , But we can know whether the change has taken effect in the multi machine room , On this basis, we set an intermediate state , This state is compatible with the state before and after the change , That solves the problem .

As shown in the figure , state A Is the state before the change , state C Is the target state after the change , state A And state C It can't happen at the same time , But the state A Change to status B, Wait until all relevant machines in all machine rooms are changed to status B, Then from B To state C, So that there is no state A And status C At the same time .

To solve the problem of global consistency of business data in the process of route update , We introduced a “ No writing ” Transition version . Before switching to the target routing machine room , Let's first set the route to the “ No writing ” Transition version , In this state , The user cannot continue to perform any action that will modify relevant business data in the current computer room or any other computer room . stay “ No writing ” Before the transition version changes to the new version , You must ensure that the local version of all route resolution has been upgraded to “ No writing ” Transition version .“ No writing ” The transition version strictly separates the effective time of the old and new routing versions , There is no case that both the old and new versions of the route take effect at a certain time , This ensures the global consistency of business data

（ notes ： be in “ No writing ” Users in the transitional version are “ No writing ” The availability of other services in the process will be affected to some extent . This impact should be accepted by the business , It is understood as a partial temporary degradation of business availability . This degradation is scheduled during periods when users are not active , It often doesn't have much impact on the user experience . Return to the line of sight of the routing table , user Id It doesn't need to be stored , Corresponding users of the home computer room bit The top three of , The number corresponding to the writable flag 4 position , by 0 Hour means that the user can write , by 1 Indicates that the user is forbidden to write .）

4.2.2.2  Solution

4.2.2.2.1  Data preparation and validation process are separated

“ No writing ” Status affects users , If the user is disabled , It means that the user cannot place an order , Although making changes during periods of user inactivity can reduce the probability of impact on users , But the change can further reduce this probability on this basis . We can separate the data preparation from the validation process .

The data preparation process is to write the user's belonging information into the distributed persistent database .

Because fast rollback is required , Because it must be a multi version write . This requires that the data in the persistence layer database be multi version . When the data is ready , The effective process of the version is Zookeeper Of watch Interaction mechanism , The process is as follows ：

When a route change is required , The data will be written into the database by the routing change control program , And define the version number .

When the data is ready , Write the version number to Zookeeper In the listening node of , all watch Will be pushed .

The machine that needs to load the routing table reads the data in the database , And load the new version of the routing table .

4.2. 2.2.2  Consistency specific scheme

Zookeeper Is a high-performance distributed coordination tool , Used for communication between nodes , It is often used in distributed configuration management , Each manufacturer is in the process of building the data consistency of the routing table , Most of them use the same solution .

In a distributed coordination scenario , Transient nodes are often used , This node is the same as the one that created it session amen , When session disappear , The node will also disappear . This mechanism is often used to check the heartbeat . In the construction of routing nodes , All nodes that need to listen to the routing table will create a transient node , Used to check the heartbeat of the node loading the routing table . The process of a single change is as follows ：

All nodes will be connected to Zk Of currentVersion Node creation watcher, Used to get the latest version of push .

All nodes will create a transient node , Name the machine , Indicates that this node is listening for changes , Based on the SessionList Under the table of contents ; When session When it disappears , Indicates that this node is not listening for changes .

When the node is pushed with a new version of the change , It will use this version number to query data in the distributed database （4.2.2.2.1 The data has been prepared before ）

When the acquisition is complete , And initialize the routing table in the local memory , The machine name will be written as a node AckList In the catalog currentVersion subdirectories , Indicates that this node has been updated for the current version .

The change procedure will compare AckList In the catalog currentVersion Whether all machine nodes in the subdirectory have been overwritten SessionList All machine nodes under the directory , If yes, it proves that all nodes have been updated to the latest version .

Because we know whether all nodes have been updated , And there are compatible “ No writing ” state , So all nodes can be updated to “ No writing ” Post state , Then change the routing information of the new version , This ensures that the states that appear are compatible , Thus, the data consistency problem is ensured .

The above steps explain ZK The directory node structure of is as follows ：

4.2.2.3  The overall architecture

The key technical details have been introduced above , The following describes the overall architecture . As mentioned earlier , The management and control system will be responsible for the regional management of all machine rooms , Contains the routing table change process . In each computer room, there will be a controlled Agent, The control system will call Agent Manage all computer rooms . During route change , For each Agent The routing data in the computer room will be written into the corresponding distributed database ,Zk Then push the information to write , No more details here .

4.2.2.3  Change process

When the routing table changes , The complete process changes are as follows ：

Save the current version number V1, The scheme used to handle rollback .

Get the current machine room list , Get all the machine rooms , Call each machine room's Agent, Execute down in sequence .

Every computer room has Agent Call the solution we mentioned above , Write data into the distributed database .

If you fail , Then directly call the 8 Step .

Get the current machine room list , Get all the machine rooms , And call the... Of each machine room circularly Agent, Change the user status .

Then use the specific scheme of consistency （4.2.2.2.2）, And change the status of all users to the final status .

If you fail , Then directly call the 8 Step . If it works , Then the process is over .

Call each machine room's Agent, Back the version . If you fail , Then carry out manual intervention .

4.2.3  User routing update scheme

The update mechanism of the routing table was introduced earlier , But how to determine the computer room to which the user belongs ？ How to change the user's computer room ？ How to add the stock users of the website to the routing table ？ And how to add new users to the routing table ？

4.2.3.1  determine

The user belongs to the computer room

In a real application scenario , Most users' attribution logic adopts the principle of performance first , It is basically equivalent to that the user belongs to the computer room with the least access delay . Of course, for most scenarios , The machine room with the smallest delay is the machine room with the closest physical distance .

How do we judge the user's computer room , Scheme as follows ：

Each user will have asynchronous access in all machine rooms , It is used to confirm the delay of users and all machine rooms .

Take the user area as the granularity for statistics , Which is the most stable computer room .

In the routing table, each user in the region is associated with the computer room with the best overall performance in the region .

4.2.3.2  Change the user's computer room

After confirming that the user belongs to the computer room , Assume that the new computer room is different from the original one , Then it is necessary to implement the ownership of a user to the computer room . As mentioned earlier , In the process of user routing , You need to rewrite the table to be forward and backward compatible “ No writing ” state , This process ensures that the change of the routing table itself will not lead to data inconsistency . But from “ No writing ” User to user “ Can write ” Transition , You also need to copy the user's data from the original computer room to the target computer room , And make sure the copy is complete . Technologies related to data replication are not discussed here , It will be described in the following chapters .

4.2.3.3 Change optimization - Time sharing changes

Write barring may affect users , Therefore, we need to optimize the change time , Reduce the probability of impact on user production . The main method is to find out when the user is most likely to be idle .

（1） In hours , And give time period identification id.

（2） Set weights for different behaviors of users , The weight represents the impact of write barring on users .

（3） Building users abc The operation records in a period of time are as follows , The following calculation method is used to calculate the conflict value of each event .

P(0)=1/(1+2+1)*0.2+4/(4+2+2)*0.8=0.45 Representative logo id by 0 The time period conflict value of is 0.45

P(1)=2/(1+2+1)*0.2+4/(4+2+2)*0.8=0.3 Representative logo id by 1 The time period conflict value of is 0.3

P(2)=1/(1+2+1)*0.2+4/(4+2+2)*0.8=0.25 Representative logo id by 0 The time period conflict value of is 0.25

The bigger the value is. , The greater the avoidance value representing this time period .

4.2.3.4  Stock renewal scheme

The stock update scheme refers to two scenarios

The scheme has just been launched

The machine has just started

These two scenarios generally refer to recalculating the home computer rooms of all users in the current system . Based on the previous knowledge , The current scheme is to determine the user's computer room as we mentioned earlier （4.2.3.1） programme .

The default optimization of attribution is specifically mentioned here , We use a certain computer room as the default computer room , All users belonging to this computer room do not need to join the routing table , When calling the routing service to query this user's route , The routing table returns a null value , The routing service directly returns to the default computer room , This greatly reduces the size of the routing table .

4.2.3.5  Full update scheme

The incremental update scheme generally refers to two scenarios

User registration

User migration

For the first case , All users of the new computer room will belong to the default computer room , Do not change any routing table , The following process is the same as the second case .

For the second case , In the process of multi machine room detection for new users , It is found that the user may not belong to this computer room , Or it may be found that the new registered user does not have the fastest access to the default computer room . Then you need to do user migration , That is, incremental update . After confirming the attribution , The incremental update is consistent with the stock update scheme , by comparison , Few users need to change the incremental update scheme . The stock update scheme does not need to be run many times .

5、 ... and 、 Summary

This article mainly introduces how to live more in different places / Basic concepts and domain modeling in the process of globalization transformation, as well as the storage optimization process of routing system . In the future, it will continue to update the remote multi activity / More about globalization , Welcome to your attention 「 Get things Technology 」 official account .

notes ：

【1】 The next day （Leadtime、LT） It is a performance commitment product launched by dewu , The core logic is through the delivery park 、 Receiving city 、 The product attributes match the lines configured in the background , This promises the user whether the goods can be delivered the next day .

writing ｜FUGUOFENG

Focus on Technology , Be the most trendy technical person ！

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/174/202206231134479088.html

当前位置：网站首页>Design of routing service for multi Activity Architecture Design

Design of routing service for multi Activity Architecture Design

One 、 background

Two 、 How to live / The technical requirements of globalization

3、 ... and 、 How to live / The overall structure of globalization

3.1  Domain modeling

3.2  Overall framework

Four 、 How to live / Global and regional deployment technologies

4.1  The overall architecture

4.1.1  Function priority

4.1.2  Deployment architecture

4.1.3  Problem analysis

4.1.4  Solution

4.2  Routing service

4.2.1  Principle of routing table

4.2.1.1  Scheme comparison

4.2.1.1.1 Introduce distributed cache

4.2.1.1.2  HashMap

4.2.1.1.3  The bloon filter

4.2.1.2  Routing table design

4.2.1.3  Routing table related design

4.2.2  Routing table update mechanism

4.2.2.1  Data consistency thinking

4.2.2.2  Solution

4.2.2.2.1  Data preparation and validation process are separated

4.2.

2.2.2  Consistency specific scheme

4.2.2.3  The overall architecture

4.2.2.3  Change process

4.2.3  User routing update scheme

4.2.3.1  determine

The user belongs to the computer room

4.2.3.2  Change the user's computer room

4.2.3.3 Change optimization - Time sharing changes

4.2.3.4  Stock renewal scheme

4.2.3.5  Full update scheme

5、 ... and 、 Summary

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>Design of routing service for multi Activity Architecture Design

Design of routing service for multi Activity Architecture Design

One 、 background

Two 、 How to live / The technical requirements of globalization

3、 ... and 、 How to live / The overall structure of globalization

3.1&nbsp; Domain modeling

3.2&nbsp; Overall framework

Four 、 How to live / Global and regional deployment technologies

4.1&nbsp; The overall architecture

4.1.1&nbsp; Function priority

4.1.2&nbsp; Deployment architecture

4.1.3&nbsp; Problem analysis

4.1.4&nbsp; Solution

4.2&nbsp; Routing service

4.2.1&nbsp; Principle of routing table

4.2.1.1&nbsp; Scheme comparison

4.2.1.1.1 Introduce distributed cache

4.2.1.1.2&nbsp;&nbsp;HashMap

4.2.1.1.3&nbsp; The bloon filter

4.2.1.2&nbsp; Routing table design

4.2.1.3&nbsp; Routing table related design

4.2.2&nbsp; Routing table update mechanism

4.2.2.1&nbsp; Data consistency thinking

4.2.2.2&nbsp; Solution

4.2.2.2.1&nbsp; Data preparation and validation process are separated

4.2.

2.2.2&nbsp; Consistency specific scheme

4.2.2.3&nbsp; The overall architecture

4.2.2.3&nbsp; Change process

4.2.3&nbsp; User routing update scheme

4.2.3.1&nbsp; determine

The user belongs to the computer room

4.2.3.2&nbsp; Change the user's computer room

4.2.3.3 Change optimization - Time sharing changes

4.2.3.4&nbsp; Stock renewal scheme

4.2.3.5&nbsp; Full update scheme

5、 ... and 、 Summary

边栏推荐

猜你喜欢

随机推荐

3.1 Domain modeling

3.2 Overall framework

4.1 The overall architecture

4.1.1 Function priority

4.1.2 Deployment architecture

4.1.3 Problem analysis

4.1.4 Solution

4.2 Routing service

4.2.1 Principle of routing table

4.2.1.1 Scheme comparison

4.2.1.1.2 HashMap

4.2.1.1.3 The bloon filter

4.2.1.2 Routing table design

4.2.1.3 Routing table related design

4.2.2 Routing table update mechanism

4.2.2.1 Data consistency thinking

4.2.2.2 Solution

4.2.2.2.1 Data preparation and validation process are separated

2.2.2 Consistency specific scheme

4.2.2.3 The overall architecture

4.2.2.3 Change process

4.2.3 User routing update scheme

4.2.3.1 determine

4.2.3.2 Change the user's computer room

4.2.3.4 Stock renewal scheme

4.2.3.5 Full update scheme