当前位置:网站首页>A set of IM architecture technology dry goods for 100 million users (Part 2): reliability, orderliness, weak network optimization, etc
A set of IM architecture technology dry goods for 100 million users (Part 2): reliability, orderliness, weak network optimization, etc
2022-06-24 17:20:00 【JackJiang】
This paper is based on Deng Yunze “ Massive concurrency IM Service Architecture Design ”、“IM Weak network scenario optimization ” The outline of the two articles is carried out , Thanks for Deng Yunze's selfless sharing .
1、 introduction
Next chapter 《 A set of 100 million users IM Architecture technology dry goods ( Part 1 ): The overall architecture 、 Service splitting, etc 》, This article mainly focuses on the IM Some of the more detailed but important hot issues of Architecture , such as : Message reliability 、 Message ordering 、 Data security 、 Mobile terminal weak network problems, etc .
These are hot IM Each topic can be written separately , But limited to the length of the article , This article will not discuss each question in detail , It mainly guides the readers to understand the key of the problem , And provide links to special research articles , Convenient and selective in-depth study . I hope this article can give you IM Development brings some benefits .
2、 Series articles
For better content presentation , This article is divided into two parts .
This article is about 2 In the article 2 piece :
《 A set of 100 million users IM Architecture technology dry goods ( Part 1 ): The overall architecture 、 Service splitting, etc 》 《 A set of 100 million users IM Architecture technology dry goods ( The next part ): reliability 、 Orderliness 、 Weak network optimization, etc 》( this paper )
This article mainly focuses on the IM Some of the more detailed but important hot issues of Architecture .
3、 Message reliability issues
The reliability of the message is IM Typical technical indexes of the system , For users , Can the message be delivered reliably ( Don't lose the news ), It's using this set of IM The premise of trust .
let me put it another way , If this IM The system can't guarantee not to lose messages , That's equivalent to the probability that every message sent will be lost , For users , It's bound to be “ don 't worry ” Use it more efficiently , namely “ Distrust ” This set of IM.
From the perspective of the product manager , There are such technical barriers , No matter how hard to promote , End users will soon be lost . So a set of IM If the reliability of the message cannot be guaranteed , The problem is very serious .
PS: If you are right about IM If there isn't an intuitive image of the problem of message reliability , adopt 《 Zero basis IM Introduction to development ( 3、 ... and ): What is? IM The reliability of the system ?》 This article is easy to understand .
As shown in the figure above , Message reliability mainly depends on 2 There's a logic to protect :
- 1) Uplink message reliability ;
- 2) Downlink message reliability .
1) For the reliability of uplink messages , We can deal with it in this way :
The user sends a message ( Suppose the agreement is called PIMSendReq), The user needs to set a local ID, Then wait for the server operation to complete and give the sender a PIMSendAck( Local ID Agreement ), Tell the user that the transmission was successful .
If you wait for a while , I didn't get this ACK, Indicates that the user failed to send , client SDK To try again .
2) For the reliability of downlink messages , We can deal with it in this way :
The service received users A The news of , Push this message to B、C、D 3 personal . hypothesis B Temporarily off the line , So online push is likely to fail .
So the core of ensuring downlink reliability is : Cache the push request before push .
This cache is guaranteed by the storage system ,MsgWriter To maintain a ( Offline message list ), A message from the user , Write at the same time B、C、D Offline message list for ,B、C、D After receiving this message , To give the storage system a ACK, Then the storage system sends the message ID Remove from the offline message list .
For the reliability of messages , Specific solutions can also be considered from another dimension : That is, the reliability of real-time message and offline message .
If you are interested, you can read these two articles in depth :
《IM The implementation of message delivery guarantee mechanism ( One ): Ensure the reliable delivery of online real-time messages 》 《IM The implementation of message delivery guarantee mechanism ( Two ): Ensure the reliable delivery of offline messages 》
And for the reliability of offline messages , There is a big difference between single chat and group chat , About reliable delivery of offline messages in group chat , You can read it in depth 《IM Develop dry goods sharing : How to achieve the reliable delivery of a large number of offline messages gracefully 》.
4、 The ordering of messages
The problem of message ordering is distributed IM Another technology in the system “ Hard bone ”.
Because it's a distributed system , The clocks of the client and server may be out of sync . If you simply rely on one party's clock , There will be a lot of messages out of order .
For example, relying only on the client's clock ,A Than B It's late 30 minute . all A to B Send a message , then B to A reply .
The order of sending is :
client A:“XXX” client B:“YYY”
The order of the recipients becomes :
client B:“YYY” client A:“XXX”
because A I'm late 30 minute , all A All the news will be at the end of the line .
If you only rely on the server's clock , There will be similar problems , because 2 The time of each server may also be inconsistent . Although the client A And the client B The clock is the same , however A The message is sent by the server S1 Handle ,B The message is sent by the server S2 Handle , It can also cause the same message to be out of order .
To solve this problem , My idea is to do such a series of operations to achieve .
1) Server time alignment :
This part is the pot of back-end operation and maintenance , It's up to the system administrator to protect as much as possible , There's no other way .
2) The client aligns the server time through time tuning :
such as : After client login , Calculate the difference between client time and server time , Consider this difference when sending a message .
In my im In architecture , This will align the time to 100ms This level , No matter how small the difference is, it will be very difficult , Because the speed of the protocol passing between the client and the server RTT It's also unstable ( There is an uncontrollable delay risk in network transmission ).
3) The message carries both local time and server time :
It can be handled in this way : When sorting , For the same person , According to the local time of the message ; For news from different people , Schedule by server time , This is an interpolation sort algorithm .
PS: About the ordering of messages , Obviously, it's not as clear as the above three or two sentences , If you want to understand it more popularly , You can read it 《 Zero basis IM Introduction to development ( Four ): What is? IM Message timing consistency of the system ?》.
in addition : From the point of view of the feasibility of technical practice ,《 A low cost guarantee IM Discussion on the method of message timing 》、《 How to ensure IM Real time message “ Timing ” And “ Uniformity ”?》 The ideas in these two articles can be used for reference .
actually , The sorting of messages , You can also get information from ID To deal with ( That is to say, let the message ID Produce order , According to the news ID You can sort messages ).
Message about order ID Algorithmic problems , These two articles are worth learning from :《IM news ID Technical topics ( One ): Massive wechat IM Practice of generating chat message serial number ( Principle of algorithm )》、《IM news ID Technical topics ( 3、 ... and ): Decrypt rongyun IM Product chat messages ID Generation strategy 》, I'll stop talking .
5、 Message read synchronization problem
Read not read function of message , As shown in the figure below :
The picture above shows the read and unread messages in the pin . It's in business IM It's very useful in this scenario ( Because leaders like , You'll see ).
Read but not read function , For one-on-one chat messages , It's easier to understand : That is to add one more corresponding receipt ( When the user reads this message, send back ).
But for group chat, this is , How many people have read this message 、 How many people didn't read , To achieve this effect , That's a bit of a problem . For the read but not read function of group chat, the implementation logic , It's not going to unfold here , If you are interested, please read this article 《IM How to realize the read receipt function of group chat message ?》.
Back to the theme of this section “ Read sync ” The problem of , This shows that the difficulty is one more level , Because read not read receipts are not just for “ account number ”, Now we have to break it down to “ Login on different terminals with the same account number ” The situation of , For the synchronization logic of read receipts , It's a little complicated .
ad locum , According to my side IM The practical experience of Architecture , Offer some ideas .
To be specific : The user may have multiple devices logging into the same account ( such as :Web PC Login and mobile at the same time ), Read not read function in this case , You need to achieve read synchronization , Otherwise in the device 1 I read the news , equipment 2 See that it's still unread , From the product point of view , This affects the user experience .
For my im In terms of Architecture , Read synchronization mainly depends on 2 There's a logic to make sure that :
- 1) Synchronous state maintenance , For every user Session, Maintain a timestamp , Save the last read time ;
- 2) If the user opens a Session, And users have multiple devices online , Send a PIMSyncRead news , Notify other devices .
6、 Data security issues
6.1 Basics
IM The data security in the system architecture is more complex than the general system , In terms of communication , It involves socket The security of long connection communication and http Dual security of short connections . And with the IM The popularity of mobile terminals , But also in security 、 performance 、 Data traffic 、 Make a trade-off between the dimensions of user experience , So I want to achieve a complete set of IM Security architecture , There are many challenges to face .
IM In the system architecture , So called data security , Mainly communication security and content security .
6.2 Communication security
So called communication security , It's about understanding IM The service composition of communication .
For the moment , A typical im System , It is mainly composed of two kinds of communication services :
- 1)socket Long connection service : Technically, it is the network communication that most people are familiar with , A little more detail, that is tcp、udp The agreement part ;
- 2)http Short connection service : Which is the most commonly used http rest Interface those .
How to improve the security of long connection , You can read in depth 《 Easy to understand : Master the message transmission security principle of instant messaging 》. in addition , Shared by wechat team 《 Wechat new generation communication security solution : be based on TLS1.3 Of MMTLS Detailed explanation 》 One article , It's also very meaningful .
If the communication security level is higher , You can refer to 《 Instant messaging security ( Two ): The combination encryption algorithm is discussed in IM Application in 》, In this paper, the idea of using the combined encryption algorithm is very good .
As for short connection security , Everyone is familiar with , Turn on https Most of the time it's enough . If for https Don't know much , We can start with these articles :《 Article to read Https The principle of safety 、 digital certificate 、 Single Certification 、 Double certification, etc 》、《 Instant messaging security ( 7、 ... and ): If you understand it like this HTTPS, One is enough 》.
6.3 Content security
This may not be easy to understand , Now that communication security has been realized , Then why are you still entangled “ Content security ”?
Let's take a look at the three functions of cryptography : encryption ( Encryption)、 authentication (Authentication), Identification of (Identification) .
In detail, it is :
encryption : Prevent bad people from accessing your data . authentication : Prevent bad people from modifying your data and you don't find it . authentication : Prevent bad people from passing off your identity .
In the last section , If a malicious attacker bypasses or breaks through the communication link “ authentication ”、“ authentication ”, So dependent on “ authentication ”、“ authentication ” Of “ encryption ”, In fact, some can be cracked .
For the above problems , So we need to encrypt the content more securely and independently , That's what it's called “ End to end encryption ”(E2E).
such as , The one that claims to be unbreakable IM——Telegram, In fact, it uses end-to-end encryption technology .
About end-to-end encryption , I'm not going to go into it here , Here are two articles that you can read in depth with interest :
《 The sharp weapon of mobile secure communication —— End to end encryption (E2EE) Technical details 》 《 End to end encryption in real-time audio and video chat (E2EE) How it works 》
7、 The avalanche problem
In distributed IM Architecture , There is an avalanche problem .
We know , A distributed IM Architecture , For high availability , Every time users log in, they are assigned to different servers according to the load balancing algorithm . So here's the problem .
for instance : Suppose there is 5 A computer room , among A Computer room failure , All the users previously served in this computer room ran to B Computer room .B The computer room was overwhelmed and collapsed ,A+B More than 100% of users run to the computer room C, The chain reaction will cause all services to hang up .
To prevent the avalanche effect, we need to build a server architecture , There are some matching solutions for client link strategy . The server needs limited flow capacity as the basis , It mainly limits the total number of service users and the number of short-term link users .
At the client level , There should be a policy after discovering that the service is disconnected , Prevent a large number of users from linking to a server at the same time .
Usually there are 2 Kind of plan :
- 1) to retreat : Set a random interval between reconnections ;
- 2)LBS: A new server that applies for reconnection with the server IP, Then from LBS Service to reduce the number of users allocated to the same server in a short time .
this 2 There is no conflict between the two schemes , You can do it at the same time .
8、 Weak network problem
8.1 The reason for the weak network problem
Now that IM The popularity of mobile terminals , Weak network is a normal problem . The elevator 、 On the train 、 Driving a car 、 Subway and so on , Will encounter obvious weak network problems .
So why is there a weak network problem ?
Answer that question , Then we need to find the answer from the principle of wireless communication .
Because the quality of wireless communication is subject to many factors , such as : Wireless signal strength changes quickly 、 Signal jamming 、 The distribution of communication base stations is uneven 、 Moving too fast and so on . Make it clear , I can't finish it for three days and three nights .
Interested readers , Be sure to read the following articles carefully , Similar interdisciplinary articles are rare :
《IM Introduction to zero basic communication technology for developers ( 11、 ... and ): Why? WiFi The signal is bad. ? You can understand... In a word !》 《IM Introduction to zero basic communication technology for developers ( Twelve ): On the network card ? The network is offline ? You can understand... In a word !》 《IM Introduction to zero basic communication technology for developers ( 13、 ... and ): Why the cell phone signal is poor ? You can understand... In a word !》 《IM Introduction to zero basic communication technology for developers ( fourteen ): How hard is it to get on the high-speed rail by wireless ? You can understand... In a word !》
The weak network problem is the mobile terminal APP The required course of , The following summaries are also worth learning :
《 Mobile IM Developers must read ( One ): Easy to understand , Understand the mobile network “ weak ” and “ slow ”》 《 Mobile IM Developers must read ( Two ): Summary of the most complete mobile weak network optimization methods in history 》 《 Summary of short connection optimization methods in modern mobile network : Request speed 、 Weak net adaptation 、 Safety guarantee 》 《 Baidu APP Mobile end network deep optimization practice sharing ( 3、 ... and ): Mobile weak network optimization 》
8.2 IM Dealing with weak network problems
about IM Come on , The weak network problem is not very complicated , The core is to resend the message 、 Sorting and retrying at the receiving end .
In order to solve the problem caused by weak network IM problem , Usually, it can be improved by the following means :
- 1) Message auto resend ;
- 2) Offline message reception ;
- 3) Resend message sort ;
- 4) Offline instruction processing .
We will discuss them one by one .
8.3 Message auto resend
On the subway , I often meet after the train starts , The network is disconnected , Failed to send message .
At this time, the product has 2 Forms of expression :
- a、 Tell the user that the sending failed ;
- b、 Keep sending , Automatic retry 3-5 Time (3 minute ) Later tell the user that the sending failed .
obviously : It's much better to tell the user that the sending failed after the auto retry failed . Especially in the case of network interruption , The success rate of retrying is very high , It's very likely that the user doesn't feel that there is a transmission failure at all .
Technically : client IMSDK Monitor the status of each message . You can't simply call the network to send a message , It's about having a state machine , Manage several States : The initial state , In sending , fail in send , Send timeout . For the state of failure and timeout , To enable the retry mechanism .
There is also a discussion post about the design of retrial mechanism , You can have a look if you are interested :《 It's completely self-developed IM How to design “ Failure to retry ” Mechanism ?》.
《IM The implementation of message delivery guarantee mechanism ( One ): Ensure the reliable delivery of online real-time messages 》 The implementation of message timeout and retransmission mechanism , You can also refer to .
8.4 Offline message reception
modern IM It's not “ On-line ” In this state , There's no need to give the user this information . But on a technical level , When the user is offline, it's still necessary to correctly perceive .
There are several ways to perceive :
- a、 Signaling long connection status : If you don't receive heartbeat feedback from the server for a long time , It means the line is off ;
- b、 Number of network request failures : If multiple network requests fail , explain ” Probably “ It's off the line ;
- c、 Device network state detection : Just check the state of the network card directly , commonly Android/iOS/Windows/Mac There are corresponding systems API.
After correctly detecting the network state , Discover the network from ” Disconnect to resume “ Handoff , To actively pull messages in the offline phase , So that the weak network will not lose messages ( Pull from the server's offline message list ).
The determination of network state mentioned in the above text , involves IM Network connection checking and keeping alive mechanism in the network , yes IM It's a headache in my life .
carelessly , Step in again IM Network keeps this pit alive , I'm not going to start here , Be sure to read the following article if you are interested :
《 Why based on TCP The mobile end of the protocol IM We still need a heartbeat mechanism ?》 《 Read and understand the network heartbeat packet mechanism in instant messaging applications : effect 、 principle 、 Implementation ideas, etc 》 《 Wechat team original sharing :Android Version of wechat backstage to keep the real battle sharing ( Network security chapter )》 《 Mobile IM practice : Realization Android Version of wechat's intelligent heartbeat mechanism 》 《 Mobile IM practice :WhatsApp、Line、 Heartbeat strategy analysis of wechat 》
8.5 Resend message sort
Another pitfall of weak net logic is message ordering .
If there is A、B、C 3 Bar message ,A、C Send successfully ,B I encountered a network flash while sending ,B Trigger auto retry .
Then the receiving order of the receiver should be A B C still A C B Well ? I've seen different IM product , The processing logic is different , If you are interested in this, you can play it .
The solution is to rely on the difference ranking mentioned in the last service architecture , A message from the same person , Sort by the local time attached to the message . News from different people , Sort by server time .
Specifically, I don't have to reply , You can go back to the fourth section of this article “4、 The ordering of messages ”.
8.6 Offline instruction processing
When part of the instructions operate , There may be something wrong with the network , After the network is restored , To automatically synchronize to the server .
For example , You can try setting your mobile phone to flight mode , Then delete a contact in wechat , See if you can delete . And then turn the network back on , See if this data will be synchronized to the server .
Similar logic is also applicable to read synchronization and other scenarios , Offline information , Synchronize with the server correctly .
8.7 To sum up
IM Weak network processing of , In fact, it is relatively simple , Basically automatic retries + Message status can solve most of the problems .
Some details are not complicated , The main reason is IM The amount of messages is relatively small , After network recovery, it can recover quickly .
The logic of video conference under weak network , It's much more complicated . Especially in the weak network environment with high packet loss , Try to ensure the fluency of audio and video .
9、 This paper summarizes
《 A set of 100 million users IM Architecture technology dry goods 》 This is the end of the two articles in this issue , The first part deals with IM Architecture is not bad , The next one accidentally brings out IM All kinds of hot issues in “ pit ”, To make IM It's hard to say enough about development ...
Suggest IM Introduction to development friends , If you want to systematically learn mobile terminal IM To develop it , I should read the article I organized IM Development “ From entry to abandonment ” The article ( Ha ha ha ), It's this one 《 Beginner level one is enough : Develop mobile from scratch IM》. I'm not going to do that anymore , Otherwise, this space will not be able to stop the car again ...( This article is published synchronously in :http://www.52im.net/thread-3445-1-1.html)
10、 Reference material
[1] Massive concurrency IM Service Architecture Design
边栏推荐
- How does the easynvr/easygbs live video platform use Wireshark to capture and analyze data locally?
- Snapshot management for elastic cloud enterprise
- IBM:以现代化架构支撑AI与多云时代的企业数字化重塑
- Classic examples of C language 100
- CentOS 7 installing SQL server2017 (Linux)
- 网站SEO排名越做越差是什么原因造成的?
- Complete the log service CLS questionnaire in 1 minute and receive the Tencent cloud 30 yuan threshold free voucher ~
- 实现TypeScript运行时类型检查
- After the collective breakthrough, where is the next step of China's public cloud?
- Analysis of signal preemptive scheduling based on go language from source code
猜你喜欢
![[leetcode108] convert an ordered array into a binary search tree (medium order traversal)](/img/e1/0fac59a531040d74fd7531e2840eb5.jpg)
[leetcode108] convert an ordered array into a binary search tree (medium order traversal)

Daily algorithm & interview questions, 28 days of special training in large factories - the 15th day (string)

MySQL learning -- table structure of SQL test questions

Why do you develop middleware when you are young? "You can choose your own way"
随机推荐
Talk about some good ways to participate in the project
How to convert XML to HL7
IBM:以现代化架构支撑AI与多云时代的企业数字化重塑
跟着Vam一起学习Typescript(第一期)
The TKE cluster node reports an error when executing kubectl
[web] what happens after entering the URL from the address bar?
[play with Tencent cloud] TSF User Guide
2021-04-02: given a square or rectangular matrix, zigzag printing can be realized.
Sigai intelligent container damage identification products are deployed in Rizhao Port and Yingkou Port
Live broadcast Preview - on April 1, I made an appointment with you to explore tcapulusdb with Tencent cloud
To redefine the storage architecture, Huawei has used more than five "cores"
## Kubernetes集群中流量暴露的几种方案 Kubernetes集群中流量暴露的几种方案
Common GCC__ attribute__
Kubernetes 1.20.5 helm installation Jenkins
让UPS“印象派用户”重新认识可靠性
TVP experts talk about geese factory middleware: innovating forward and meeting the future
Ramda 鲜为人知的一面
FPGA systematic learning notes serialization_ Day8 [design of 4-bit multiplier and 4-bit divider]
究竟有哪些劵商推荐?现在网上开户安全么?
Cloud development environment to create a five-star development experience