当前位置:网站首页>A set of IM architecture technology dry goods for 100 million users (Part 2): reliability, orderliness, weak network optimization, etc

A set of IM architecture technology dry goods for 100 million users (Part 2): reliability, orderliness, weak network optimization, etc

2022-06-24 17:20:00 JackJiang

This paper is based on Deng Yunze “ Massive concurrency IM Service Architecture Design ”、“IM Weak network scenario optimization ” The outline of the two articles is carried out , Thanks for Deng Yunze's selfless sharing .

1、 introduction

Next chapter 《 A set of 100 million users IM Architecture technology dry goods ( Part 1 ): The overall architecture 、 Service splitting, etc 》, This article mainly focuses on the IM Some of the more detailed but important hot issues of Architecture , such as : Message reliability 、 Message ordering 、 Data security 、 Mobile terminal weak network problems, etc .

These are hot IM Each topic can be written separately , But limited to the length of the article , This article will not discuss each question in detail , It mainly guides the readers to understand the key of the problem , And provide links to special research articles , Convenient and selective in-depth study . I hope this article can give you IM Development brings some benefits .

2、 Series articles

For better content presentation , This article is divided into two parts .

This article is about 2 In the article 2 piece :

A set of 100 million users IM Architecture technology dry goods ( Part 1 ): The overall architecture 、 Service splitting, etc 》 《 A set of 100 million users IM Architecture technology dry goods ( The next part ): reliability 、 Orderliness 、 Weak network optimization, etc 》( this paper )

This article mainly focuses on the IM Some of the more detailed but important hot issues of Architecture .

3、 Message reliability issues

The reliability of the message is IM Typical technical indexes of the system , For users , Can the message be delivered reliably ( Don't lose the news ), It's using this set of IM The premise of trust .

let me put it another way , If this IM The system can't guarantee not to lose messages , That's equivalent to the probability that every message sent will be lost , For users , It's bound to be “ don 't worry ” Use it more efficiently , namely “ Distrust ” This set of IM.

From the perspective of the product manager , There are such technical barriers , No matter how hard to promote , End users will soon be lost . So a set of IM If the reliability of the message cannot be guaranteed , The problem is very serious .

PS: If you are right about IM If there isn't an intuitive image of the problem of message reliability , adopt 《 Zero basis IM Introduction to development ( 3、 ... and ): What is? IM The reliability of the system ?》 This article is easy to understand .

As shown in the figure above , Message reliability mainly depends on 2 There's a logic to protect :

  • 1) Uplink message reliability ;
  • 2) Downlink message reliability .

1) For the reliability of uplink messages , We can deal with it in this way :

The user sends a message ( Suppose the agreement is called PIMSendReq), The user needs to set a local ID, Then wait for the server operation to complete and give the sender a PIMSendAck( Local ID Agreement ), Tell the user that the transmission was successful .

If you wait for a while , I didn't get this ACK, Indicates that the user failed to send , client SDK To try again .

2) For the reliability of downlink messages , We can deal with it in this way :

The service received users A The news of , Push this message to B、C、D 3 personal . hypothesis B Temporarily off the line , So online push is likely to fail .

So the core of ensuring downlink reliability is : Cache the push request before push .

This cache is guaranteed by the storage system ,MsgWriter To maintain a ( Offline message list ), A message from the user , Write at the same time B、C、D Offline message list for ,B、C、D After receiving this message , To give the storage system a ACK, Then the storage system sends the message ID Remove from the offline message list .

For the reliability of messages , Specific solutions can also be considered from another dimension : That is, the reliability of real-time message and offline message .

If you are interested, you can read these two articles in depth :

IM The implementation of message delivery guarantee mechanism ( One ): Ensure the reliable delivery of online real-time messages 》 《IM The implementation of message delivery guarantee mechanism ( Two ): Ensure the reliable delivery of offline messages

And for the reliability of offline messages , There is a big difference between single chat and group chat , About reliable delivery of offline messages in group chat , You can read it in depth 《IM Develop dry goods sharing : How to achieve the reliable delivery of a large number of offline messages gracefully 》.

4、 The ordering of messages

The problem of message ordering is distributed IM Another technology in the system “ Hard bone ”.

Because it's a distributed system , The clocks of the client and server may be out of sync . If you simply rely on one party's clock , There will be a lot of messages out of order .

For example, relying only on the client's clock ,A Than B It's late 30 minute . all A to B Send a message , then B to A reply .

The order of sending is :

client A:“XXX” client B:“YYY”

The order of the recipients becomes :

client B:“YYY” client A:“XXX”

because A I'm late 30 minute , all A All the news will be at the end of the line .

If you only rely on the server's clock , There will be similar problems , because 2 The time of each server may also be inconsistent . Although the client A And the client B The clock is the same , however A The message is sent by the server S1 Handle ,B The message is sent by the server S2 Handle , It can also cause the same message to be out of order .

To solve this problem , My idea is to do such a series of operations to achieve .

1) Server time alignment :

This part is the pot of back-end operation and maintenance , It's up to the system administrator to protect as much as possible , There's no other way .

2) The client aligns the server time through time tuning :

such as : After client login , Calculate the difference between client time and server time , Consider this difference when sending a message .

In my im In architecture , This will align the time to 100ms This level , No matter how small the difference is, it will be very difficult , Because the speed of the protocol passing between the client and the server RTT It's also unstable ( There is an uncontrollable delay risk in network transmission ).

3) The message carries both local time and server time :

It can be handled in this way : When sorting , For the same person , According to the local time of the message ; For news from different people , Schedule by server time , This is an interpolation sort algorithm .

PS: About the ordering of messages , Obviously, it's not as clear as the above three or two sentences , If you want to understand it more popularly , You can read it 《 Zero basis IM Introduction to development ( Four ): What is? IM Message timing consistency of the system ?》.

in addition : From the point of view of the feasibility of technical practice ,《 A low cost guarantee IM Discussion on the method of message timing 》、《 How to ensure IM Real time message “ Timing ” And “ Uniformity ”?》 The ideas in these two articles can be used for reference .

actually , The sorting of messages , You can also get information from ID To deal with ( That is to say, let the message ID Produce order , According to the news ID You can sort messages ).

Message about order ID Algorithmic problems , These two articles are worth learning from :IM news ID Technical topics ( One ): Massive wechat IM Practice of generating chat message serial number ( Principle of algorithm )》、《IM news ID Technical topics ( 3、 ... and ): Decrypt rongyun IM Product chat messages ID Generation strategy 》, I'll stop talking .

5、 Message read synchronization problem

Read not read function of message , As shown in the figure below : 

The picture above shows the read and unread messages in the pin . It's in business IM It's very useful in this scenario ( Because leaders like , You'll see ).

Read but not read function , For one-on-one chat messages , It's easier to understand : That is to add one more corresponding receipt ( When the user reads this message, send back ).

But for group chat, this is , How many people have read this message 、 How many people didn't read , To achieve this effect , That's a bit of a problem . For the read but not read function of group chat, the implementation logic , It's not going to unfold here , If you are interested, please read this article 《IM How to realize the read receipt function of group chat message ?》.

Back to the theme of this section “ Read sync ” The problem of , This shows that the difficulty is one more level , Because read not read receipts are not just for “ account number ”, Now we have to break it down to “ Login on different terminals with the same account number ” The situation of , For the synchronization logic of read receipts , It's a little complicated .

ad locum , According to my side IM The practical experience of Architecture , Offer some ideas .

To be specific : The user may have multiple devices logging into the same account ( such as :Web PC Login and mobile at the same time ), Read not read function in this case , You need to achieve read synchronization , Otherwise in the device 1 I read the news , equipment 2 See that it's still unread , From the product point of view , This affects the user experience .

For my im In terms of Architecture , Read synchronization mainly depends on 2 There's a logic to make sure that :

  • 1) Synchronous state maintenance , For every user Session, Maintain a timestamp , Save the last read time ;
  • 2) If the user opens a Session, And users have multiple devices online , Send a PIMSyncRead news , Notify other devices .

6、 Data security issues

6.1 Basics

IM The data security in the system architecture is more complex than the general system , In terms of communication , It involves socket The security of long connection communication and http Dual security of short connections . And with the IM The popularity of mobile terminals , But also in security 、 performance 、 Data traffic 、 Make a trade-off between the dimensions of user experience , So I want to achieve a complete set of IM Security architecture , There are many challenges to face .

IM In the system architecture , So called data security , Mainly communication security and content security .

6.2 Communication security

So called communication security , It's about understanding IM The service composition of communication .

For the moment , A typical im System , It is mainly composed of two kinds of communication services :

  • 1)socket Long connection service : Technically, it is the network communication that most people are familiar with , A little more detail, that is tcp、udp The agreement part ;
  • 2)http Short connection service : Which is the most commonly used http rest Interface those .

How to improve the security of long connection , You can read in depth 《 Easy to understand : Master the message transmission security principle of instant messaging 》. in addition , Shared by wechat team 《 Wechat new generation communication security solution : be based on TLS1.3 Of MMTLS Detailed explanation 》 One article , It's also very meaningful .

If the communication security level is higher , You can refer to 《 Instant messaging security ( Two ): The combination encryption algorithm is discussed in IM Application in 》, In this paper, the idea of using the combined encryption algorithm is very good .

As for short connection security , Everyone is familiar with , Turn on https Most of the time it's enough . If for https Don't know much , We can start with these articles :《 Article to read Https The principle of safety 、 digital certificate 、 Single Certification 、 Double certification, etc 》、《 Instant messaging security ( 7、 ... and ): If you understand it like this HTTPS, One is enough 》.

6.3 Content security

This may not be easy to understand , Now that communication security has been realized , Then why are you still entangled “ Content security ”?

Let's take a look at the three functions of cryptography : encryption ( Encryption)、 authentication (Authentication), Identification of (Identification) .

In detail, it is :

encryption : Prevent bad people from accessing your data . authentication : Prevent bad people from modifying your data and you don't find it . authentication : Prevent bad people from passing off your identity .

In the last section , If a malicious attacker bypasses or breaks through the communication link “ authentication ”、“ authentication ”, So dependent on “ authentication ”、“ authentication ” Of “ encryption ”, In fact, some can be cracked .

For the above problems , So we need to encrypt the content more securely and independently , That's what it's called “ End to end encryption ”(E2E).

such as , The one that claims to be unbreakable IM——Telegram, In fact, it uses end-to-end encryption technology .

About end-to-end encryption , I'm not going to go into it here , Here are two articles that you can read in depth with interest :

The sharp weapon of mobile secure communication —— End to end encryption (E2EE) Technical details 》 《 End to end encryption in real-time audio and video chat (E2EE) How it works

7、 The avalanche problem

In distributed IM Architecture , There is an avalanche problem .

We know , A distributed IM Architecture , For high availability , Every time users log in, they are assigned to different servers according to the load balancing algorithm . So here's the problem .

for instance : Suppose there is 5 A computer room , among A Computer room failure , All the users previously served in this computer room ran to B Computer room .B The computer room was overwhelmed and collapsed ,A+B More than 100% of users run to the computer room C, The chain reaction will cause all services to hang up .

To prevent the avalanche effect, we need to build a server architecture , There are some matching solutions for client link strategy . The server needs limited flow capacity as the basis , It mainly limits the total number of service users and the number of short-term link users .

At the client level , There should be a policy after discovering that the service is disconnected , Prevent a large number of users from linking to a server at the same time .

Usually there are 2 Kind of plan :

  • 1) to retreat : Set a random interval between reconnections ;
  • 2)LBS: A new server that applies for reconnection with the server IP, Then from LBS Service to reduce the number of users allocated to the same server in a short time .

this 2 There is no conflict between the two schemes , You can do it at the same time .

8、 Weak network problem

8.1 The reason for the weak network problem

Now that IM The popularity of mobile terminals , Weak network is a normal problem . The elevator 、 On the train 、 Driving a car 、 Subway and so on , Will encounter obvious weak network problems .

So why is there a weak network problem ?

Answer that question , Then we need to find the answer from the principle of wireless communication .

Because the quality of wireless communication is subject to many factors , such as : Wireless signal strength changes quickly 、 Signal jamming 、 The distribution of communication base stations is uneven 、 Moving too fast and so on . Make it clear , I can't finish it for three days and three nights .

Interested readers , Be sure to read the following articles carefully , Similar interdisciplinary articles are rare :

IM Introduction to zero basic communication technology for developers ( 11、 ... and ): Why? WiFi The signal is bad. ? You can understand... In a word !》 《IM Introduction to zero basic communication technology for developers ( Twelve ): On the network card ? The network is offline ? You can understand... In a word !》 《IM Introduction to zero basic communication technology for developers ( 13、 ... and ): Why the cell phone signal is poor ? You can understand... In a word !》 《IM Introduction to zero basic communication technology for developers ( fourteen ): How hard is it to get on the high-speed rail by wireless ? You can understand... In a word !

The weak network problem is the mobile terminal APP The required course of , The following summaries are also worth learning :

Mobile IM Developers must read ( One ): Easy to understand , Understand the mobile network “ weak ” and “ slow ”》 《 Mobile IM Developers must read ( Two ): Summary of the most complete mobile weak network optimization methods in history 》 《 Summary of short connection optimization methods in modern mobile network : Request speed 、 Weak net adaptation 、 Safety guarantee 》 《 Baidu APP Mobile end network deep optimization practice sharing ( 3、 ... and ): Mobile weak network optimization

8.2 IM Dealing with weak network problems

about IM Come on , The weak network problem is not very complicated , The core is to resend the message 、 Sorting and retrying at the receiving end .

In order to solve the problem caused by weak network IM problem , Usually, it can be improved by the following means :

  • 1) Message auto resend ;
  • 2) Offline message reception ;
  • 3) Resend message sort ;
  • 4) Offline instruction processing .

We will discuss them one by one .

8.3 Message auto resend

On the subway , I often meet after the train starts , The network is disconnected , Failed to send message .

At this time, the product has 2 Forms of expression :

  • a、 Tell the user that the sending failed ;
  • b、 Keep sending , Automatic retry 3-5 Time (3 minute ) Later tell the user that the sending failed .

obviously : It's much better to tell the user that the sending failed after the auto retry failed . Especially in the case of network interruption , The success rate of retrying is very high , It's very likely that the user doesn't feel that there is a transmission failure at all .

Technically : client IMSDK Monitor the status of each message . You can't simply call the network to send a message , It's about having a state machine , Manage several States : The initial state , In sending , fail in send , Send timeout . For the state of failure and timeout , To enable the retry mechanism .

There is also a discussion post about the design of retrial mechanism , You can have a look if you are interested : It's completely self-developed IM How to design “ Failure to retry ” Mechanism ?》.

IM The implementation of message delivery guarantee mechanism ( One ): Ensure the reliable delivery of online real-time messages 》 The implementation of message timeout and retransmission mechanism , You can also refer to .

8.4 Offline message reception

modern IM It's not “ On-line ” In this state , There's no need to give the user this information . But on a technical level , When the user is offline, it's still necessary to correctly perceive .

There are several ways to perceive :

  • a、 Signaling long connection status : If you don't receive heartbeat feedback from the server for a long time , It means the line is off ;
  • b、 Number of network request failures : If multiple network requests fail , explain ” Probably “ It's off the line ;
  • c、 Device network state detection : Just check the state of the network card directly , commonly Android/iOS/Windows/Mac There are corresponding systems API.

After correctly detecting the network state , Discover the network from ” Disconnect to resume “ Handoff , To actively pull messages in the offline phase , So that the weak network will not lose messages ( Pull from the server's offline message list ).

The determination of network state mentioned in the above text , involves IM Network connection checking and keeping alive mechanism in the network , yes IM It's a headache in my life .

carelessly , Step in again IM Network keeps this pit alive , I'm not going to start here , Be sure to read the following article if you are interested :

Why based on TCP The mobile end of the protocol IM We still need a heartbeat mechanism ?》 《 Read and understand the network heartbeat packet mechanism in instant messaging applications : effect 、 principle 、 Implementation ideas, etc 》 《 Wechat team original sharing :Android Version of wechat backstage to keep the real battle sharing ( Network security chapter )》 《 Mobile IM practice : Realization Android Version of wechat's intelligent heartbeat mechanism 》 《 Mobile IM practice :WhatsApp、Line、 Heartbeat strategy analysis of wechat

8.5 Resend message sort

Another pitfall of weak net logic is message ordering .

If there is A、B、C  3 Bar message ,A、C Send successfully ,B I encountered a network flash while sending ,B Trigger auto retry .

Then the receiving order of the receiver should be A B C still A C B Well ? I've seen different IM product , The processing logic is different , If you are interested in this, you can play it .

The solution is to rely on the difference ranking mentioned in the last service architecture , A message from the same person , Sort by the local time attached to the message . News from different people , Sort by server time .

Specifically, I don't have to reply , You can go back to the fourth section of this article “4、 The ordering of messages ”.

8.6 Offline instruction processing

When part of the instructions operate , There may be something wrong with the network , After the network is restored , To automatically synchronize to the server .

For example , You can try setting your mobile phone to flight mode , Then delete a contact in wechat , See if you can delete . And then turn the network back on , See if this data will be synchronized to the server .

Similar logic is also applicable to read synchronization and other scenarios , Offline information , Synchronize with the server correctly .

8.7 To sum up

IM Weak network processing of , In fact, it is relatively simple , Basically automatic retries + Message status can solve most of the problems .

Some details are not complicated , The main reason is IM The amount of messages is relatively small , After network recovery, it can recover quickly .

The logic of video conference under weak network , It's much more complicated . Especially in the weak network environment with high packet loss , Try to ensure the fluency of audio and video .

9、 This paper summarizes

《 A set of 100 million users IM Architecture technology dry goods 》 This is the end of the two articles in this issue , The first part deals with IM Architecture is not bad , The next one accidentally brings out IM All kinds of hot issues in “ pit ”, To make IM It's hard to say enough about development ...

Suggest IM Introduction to development friends , If you want to systematically learn mobile terminal IM To develop it , I should read the article I organized IM Development “ From entry to abandonment ” The article ( Ha ha ha ), It's this one 《 Beginner level one is enough : Develop mobile from scratch IM》. I'm not going to do that anymore , Otherwise, this space will not be able to stop the car again ...( This article is published synchronously in :http://www.52im.net/thread-3445-1-1.html

10、 Reference material

[1]  Massive concurrency IM Service Architecture Design

[2] IM Weak network scenario optimization

[3]  Zero basis IM Introduction to development ( 3、 ... and ): What is? IM The reliability of the system ?

[4] IM The implementation of message delivery guarantee mechanism ( One ): Ensure the reliable delivery of online real-time messages

[5] IM Develop dry goods sharing : How to achieve the reliable delivery of a large number of offline messages gracefully

[6]  Instant messaging security ( Two ): The combination encryption algorithm is discussed in IM Application in

[7]  Wechat new generation communication security solution : be based on TLS1.3 Of MMTLS Detailed explanation

原网站

版权声明
本文为[JackJiang]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/03/20210322183714259j.html