当前位置:网站首页>How did Tencent's technology bulls complete the overall cloud launch?
How did Tencent's technology bulls complete the overall cloud launch?
2022-06-23 13:18:00 【Programmer Xiaohui】
What's the matter with the cloud ?
It's the Internet age , Cloud services have changed our lives , It also changed the whole IT industry . What exactly is cloud service ? Xiao Hui made an analogy in his previous articles :
There are... In the village 100 family , Every family should build their own house . If every family prepares wood and bricks in person , Laying foundation by oneself 、 Set up a beam 、 Wall laying and tile laying , This is equivalent to the traditional independent research and development ; If everyone goes to invite a professional carpenter in the village 、 Bricklayer 、 Painter , Command craftsmen to complete various basic work , This is equivalent to using the resources of cloud services .
Cloud services are developing very rapidly in China , In recent years, many excellent cloud service platforms have emerged , A friend of Xiaohui happens to be a member of the Tencent cloud team , I communicated with him a few days ago , Learned many impressive stories .
Although Tencent cloud has a large number of customers in China , However, Tencent is facing some problems left over by history : Each business line of Tencent often built its own wheels when it was first developed , Rely on a variety of underlying frameworks and interfaces . Time is long. , One is This leads to the disconnection between the technology and the mainstream technology system , Second, they also Impact on development efficiency .
To solve this problem , Tencent is inside 2018 In, the self research cloud strategy was launched , Until recently , Tencent announced that its massive internal self-research business has been put into the cloud , This is also the largest cloud native practice in China .
As a frontline farmer , Why go to the cloud ? Shangyun is the company's “ Political missions ” Or what programmers really want ? Xiao Hui chatted with a friend .
“ The original architecture is good , Why go to the cloud ?”
Xiaohui's friend is Ma Tongxing, technical director of Tencent photon happy game studio .
As a national representative of leisure games ,“ Happy game ” The user volume is huge :2019 There are tens of millions of active users every year , At the same time, there are more than one million online players , Today, there are still tens of millions of daily activities .
Business people know , Such an evergreen business , There must be a stable technical architecture and mature operation behind it .
So here comes the question : The previous architecture worked well , Why go to the cloud ?
The answer given by Ma Tongxing is very brief ,“ Because the cloud is right there .” I understand , In his opinion ,“ cloud ” It is inevitable for the industry , It's the general trend , The cloud makes it impossible for all technology practitioners to turn a blind eye .
If we describe Ma Tongxing according to Xiao Hui's impression , These keywords can be roughly used to summarize : The technocrats 、 Talkative 、 Super learning ability 、 Willing to share 、 Embrace open source …… As the technical director of Tencent photon happy game , Cloud of happy series games is also a project initiated and led by him .
It is worth mentioning that , As early as 2019 Beginning of the year , Ma Tongxing made a decision : Based on open source solutions , Reconstruct the original technical architecture of happy games , And migrate the overall business to the cloud .
In the process of chatting with Ma Tongxing , The love and passion for cloud technology is beyond words . It doesn't need the company's existing services to transform , It's self research service , What pushed him to make this decision , At that time, the community increasingly received attention to the open source service grid (service mesh)——Istio 1.1 Version launch . It's not long before we switch from a mature architecture to open source 、 There is no plan for large-scale implementation , Many students in the team are bottomless .“ The risk is too great ”、“ Can't control ”、“ Business pressure is too great ”, Similar concerns abound .

But Ma Tongxing doesn't think so , He felt compelled to do “ Hard and right things ”. Xiao Hui chatted with him , Let Ma Tongxing tell the whole cloud story of him and his team , Hope to give you some inspiration :
“ If we don't do , Three years later, it will be far worse ”
in fact , We were talking about refactoring twoorthree years before the cloud launch .
Probably 2018 In the second half of , We started to do service discovery 、 Some pre research on traffic management , Research the community's programs and focus on Istio. To 19 Beginning of the year , That is, before the Spring Festival 2 Month of the month , We started to do technical validation . Just released at that time Istio 1.1, It's the first one Enterprise Ready edition , There are few large-scale applications in the industry , Tencent doesn't have such a service inside .
Our team read some technical materials together , I started to study this . At first I thought , Microservice architecture and governance of service grid , yes “ Old wine in new bottles ”—— I thought there was a lot of progress 、 Making each function smaller is called microservice , Later, I found that it was really not , The microservice governance here refers to the ability of system scheduling governance .
How to understand this governance capability ? Let me give you an example , Do you have 100 Services are doing different things , At this time A The service wants to access 30 individual ,B Services access other 20 individual ,C Services access other 15 individual , this 15 One of them will visit again A service … You think of it as 100 Personal cooperation means —— Interaction is dotted , Pull a body .
And the real microservice governance capability , No matter how detailed your services are 、 How complex the structure is , These services are scheduled 、 There is no need to pay extra attention to the details of disaster tolerance, fault tolerance and scaling . Its governance difficulty will not become complicated as your system becomes larger , At no significant additional cost .
All in all , It is to reconstruct the service and governance capabilities 、 Sink to the infrastructure floor 、 Avoid intruding into the business architecture for traffic management and scheduling . In my opinion, this design idea is more progressiveness than the traditional distributed architecture in the past .
The microservice governance architecture represented by this open source solution , That's what I said to do “ The right thing ”.
Our primary goal in the early stage is not to save costs , From the beginning, everyone went to the cloud with the goal of improving traffic management capability .
We extrapolate from the big technology trends and opportunities , At that time, I wanted to realize the automatic service discovery of some business modules , In fact, there are other low-cost ways to do , However, if you compare the technical architecture capabilities of the service grid and even the entire cloud native community , That would be a long way off .
such as , There is a fault in the large fan out system , In the past, you had to analyze the logs to see what went wrong , But for Yunyuan , The whole call link is very clear , Automatically generate call link topology for you , You will soon see the root node of the problem , It is a qualitative improvement to the efficiency of the whole research and operation .
Personally, I think this must be a big direction , It is even very important to improve the profit margin of cloud business . These capabilities are big opportunities and trends in the industry , If we don't do , Now it can also operate very well , But three years later , You may be far away .
“ Control and cognition of technology , It is a powerful weapon against fear and worry ”
When starting this thing , Many colleagues in the team are hesitant .
The person in charge of our studio attaches great importance to and spares no effort in technological innovation , The studio has set up a public technical team to take charge of technical pre research work , At the same time, there are technical teams in each business project team , If you want to make such a big architectural adjustment , In addition to the support of the boss, all core backbones need to be willing to do this from the bottom of their hearts , Can go on . Students in the project team , He will think more about project iteration and the requirements for version stability , Subconscious anxiety : Cut to a technical solution beyond my experience , Will there be a lot of problems , Is it also good to run fast in small steps under the original structure ?
At that time, I actually spent a lot of time talking with the technical backbones of various project teams : Why should we do it based on open source , What are the benefits of doing this for the development of the team and individuals , How can the risks and challenges of technology architecture adjustment be borne by the organization rather than the individual .
Some colleagues are very keen on technology , The response was quite positive . Other colleagues who prefer to ensure business priority will still hesitate , I will go on to propose , Let's start with small-scale projects , In this way, the confidence of the team was initially gathered .
The second thing is to unify the goal of the team . At that time, our team did many rounds of sharing and research , Even translated a book K8S The book of , Finally, the goal is unified : Based on a whole set of open source technology stack to refactor , Including from the bottom layer of the protocol , To open source remote procedure call system gRPC, To the service grid and K8S Service choreography for . The reason is also very simple : This set of things has gradually become the de facto standard of the industry , Don't follow it , We'll fall behind , It's impossible to beat someone else by yourself .

About 2019 year 6 month , The results of the pilot project validation have been evaluated : The scheme is feasible . At this time, the team atmosphere is completely different from that at the beginning , Become confident , Because we have thoroughly understood this technology . Control and cognition of technology , It is the most powerful weapon to eliminate fear and worry . Wait until the large-scale cloud reconfiguration later , Everyone is familiar with it .
Half a million people “ The universe moves ”
Technical validation this is the first node , To the second node , What we need to do is to smoothly transition the business to the cloud native environment .
But this cloud does not mean that moving our services to the self-developed cloud is called cloud , Suppose we just put IDC The production environment of has been moved to the self research cloud of Tencent cloud , But still use the original architecture , In my opinion, its meaning is greatly reduced .
The company has vigorously promoted self-development to the cloud, creating a very good atmosphere for technological change , And here it is 2019 year 6 About month ,TKE The team also started to do mesh 了 , And set up a joint support team with us . On this basis , We started to refactor the original business one by one , Put it in the cloud's native environment .
The technical difficulty of refactoring is not so high , But smooth service migration is a problem . At that time, part of our business was on the cloud native architecture , The other part is under the cloud . In this heterogeneous architecture , Ensure that the migration process is smooth 、 Users have no perception , It is a very challenging thing for the business .
Because games are different from ordinary Internet services , Let's say you buy something , If it fails, try again , But the game is interactive in real time , The game process of dropping the line and coming back has reached a new stage , Other players will not accompany you to repeat the missed game process . It's like saying , You have to move hundreds of thousands of people from one playground to another , They can't feel the same .
In fact, in the process of going to the cloud , There are doubts on the business side .2021 It's the Spring Festival , We and mobile phones QQ Cooperation in an operational activity , At that time, a large number of users rushed in , So there is an overload , That is, a large number of users queue up .
At that time, the operating students did not inform the R & D team of this activity in advance , Part of our business was still under the cloud , There is no dynamic elastic computing capability , So this fault happened . Later, we will sum up the reasons for this with the operation students , They don't quite understand , Say our system can support oneortwo million online , Why can't half a million people come ?
I gave them an example to explain : An office building can support 5000 people , But every morning the elevator line is very long , Our situation is the same as this . One is capacity , One is logging in to the concurrent load , The capacity can be very large , But it doesn't mean that we have a strong ability to deal with it in a short time . At present, this part of business does not have the ability of dynamic elastic computing , So it will be like this . We are now doing a large technical reconfiguration called cloud native reconfiguration , One of its core capabilities is to solve this problem : If only one person goes , This elevator will become very small , Save resources ; If 50000 people suddenly come , It will automatically become very wide , Let those 50000 people pass quickly .
After talking about the operation, the students will understand , In the follow-up collaboration, it is much more convenient . So after this, I concluded , The technical manager should give the team three confidence :
First of all , There is sufficient basic verification , Prove that the direction is reliable ;
second , Major reconfiguration is inevitably full of technical risks , Since you choose to do , Be committed to the team , As a technical manager, we must first bear the risk responsibility . If something goes wrong, don't blame the team blindly , Because doing nothing is the least risky ;
Third , In a way that non-technical people can understand , Extend the influence of technology to other teams upstream and downstream , Gain the trust and support of others , To promote technological development .
Remember 2020 year 11 month 17 The morning of 11 It is the first time that we have a full online upgrade across the cluster grid , After upgrading, some cross cluster routing information is lost , Lead to a large number of failures of an important game function . R & D students devote themselves to fault business recovery , Planning of the project team 、 The operation students make an announcement 、 User compensation scheme , Contact customer service to appease players . Everyone is unconditionally supporting , No one complained . After R & D students restore services and complete player data compensation , And TKE Together, the team thoroughly investigated the root cause of the problem . Finally in the afternoon 17:44, The backstage students in the project team know that the service has been restored and has been put on the client function portal again , Not only is there no blame or doubt . We GM Instead, I care about everyone ,“ Many backstage students are busy dealing with problems , I haven't had lunch yet , hard .” The expectations of the studio for technological innovation 、 Tolerance and patience are the major prerequisites , The project of steady operation is the scene where the reconstruction and evolution can be implemented , What is more valuable is the full cooperation and support of happy young partners with different professional backgrounds , These are important guarantees for the smooth implementation of the original reconstruction of happy cloud .
As early as 2020 Year of 6 month , All game matchmaking services of joy have been deployed to the cloud native environment , The game matchmaking service with complex and strong state real-time interaction also has elastic computing capability for the first time . As far as we know , This should be the first team to achieve this level in the game field .
Higher goals : From the peak 60% To full time 70%
in fact , Our three stages , Including technical verification , Including the migration of core services , Finally, we will transfer the stock services , We all have a big principle : Not in a hurry , The number of hours when the core is not assembled , Just look at one thing : Whether the cloud's native elastic computing power has been fully utilized , This criterion is that you can at least arrive when you are busy 60% The load of .
In fact, any game product has a life cycle , Under the cloud , Everyone is a pile of machines , To complete the load during the peak period , But after the rush hour , The actual utilization of those machines is very low . We've done calculations , Many projects are built without elastic computing , The peak payload is only 30% Less than , Lower load rate in low peak period .
If your business is on the cloud , The utilization rate is the same as that under the cloud , I think it's a very bad result . We can't take the cloud as a passive task assigned by the company , It is far from enough to complete the basic cloud index .
So we emphasize , Don't spell the speed of the cloud and the number of cores , But the quality of the cloud , We must reconstruct the business system according to the cloud's native capabilities , When you go up there, there will be real elastic calculation , Service scheduling 、 The capacity of traffic management .
When the old business is reconstructed and put into the cloud 、 After the heterogeneous system is eliminated , I believe that the overall resource utilization rate will reach a higher level . My expectation is that I can be full-time in the future 70%+, It's not just rush hours 70% utilization , But it can also be done in the low peak period 70%, Peak elimination and valley filling , Use idle computing resources for offline computing services , This is a long-term goal .
If you really do it , The number of core hours we use may be reduced to the present 1/3 Even lower , Of course , It's not an easy thing .
There is one thing that I am particularly touched by : Last summer, our project team organized a mission to Wugong Mountain in Jiangxi Province , It took hours to climb on foot , Finally, people really stand on that cloud , The cloud is fifty or sixty meters below your feet 、 A hundred meters away , Everyone is very excited , Start taking photos and sending them to friends .
“ We're really ‘ On the cloud ’ The team .” Many people write this in their circle of friends , At this time I know , We have this “ The right thing ” You really did it right .

( Happy game group construction photos )
People often say that , Ship disaster U-turn , With the huge volume of Tencent , It is self-evident that it is difficult to go to the cloud . however , Just like the story above , There are many people like Ma Tongxing in Tencent , Try bravely with your own technical experience and ideas , Finally completed one impossible task , Many old projects within Tencent are rejuvenated in the cloud .
When I hear these stories , Xiao Hui admires and admires .IT Most people are engaged in trivial business work , Can catch up with such a huge project to transform the original cloud , Both challenges and opportunities . When they learn and try , Work together to overcome difficulties , The moment when I finally finished going to the cloud , I believe that each participant's personal ability will be greatly improved .
This is the technical person of Tencent , This is the spirit of Tencent , I hope Tencent cloud can be supported by such a group of lovely technicians , The farther you go , Higher and higher .
边栏推荐
- 涉及第三方支付接口,怎么测?
- React query tutorial ④ - cache status and debugging tools
- 解决“Thread 1: “-[*.CollectionNormalCellView isSelected]: unrecognized selector sent to instance 0x7f”
- 跟循泰国国内游宣传曲MV,像本地人一样游曼谷
- Chinatown hiking: feel the strong Chinese flavor in the exotic New York
- windows 安装 MySQL
- Have you ever encountered incompatibility between flink1.15.0 and Flink CDC MySQL 2.2.1? f
- In flinksql, the Kafka flow table and MySQL latitude flow table are left joined, and the association is made according to I'd. false
- "Developer talk" nail connector +oa approval to realize digitalization of school students' leave and work scenes
- Object pool framework
猜你喜欢

Part C - value types and reference types

Stimulsoft Ultimate Reports 2022.3.1

Basic data type and corresponding packing class

Hanyuan high tech USB2.0 optical transceiver USB2.0 optical fiber extender USB2.0 optical fiber transmitter USB2.0 interface to optical fiber

Excel-vba quick start (I. macros, VBA, procedures, types and variables, functions)

Go写文件的权限 WriteFile(filename, data, 0644)?

Go寫文件的權限 WriteFile(filename, data, 0644)?
![解决“Thread 1: “-[*.CollectionNormalCellView isSelected]: unrecognized selector sent to instance 0x7f”](/img/35/65511c49eca5ae8a1896d776b479d9.jpg)
解决“Thread 1: “-[*.CollectionNormalCellView isSelected]: unrecognized selector sent to instance 0x7f”

First exposure! The only Alibaba cloud native security panorama behind the highest level in the whole domain

Esp32-c3 introductory tutorial problem ⑦ - fatal error: ESP_ Bt.h: no such file or directory ESP not found_ bt.h
随机推荐
Online text entity extraction capability helps applications analyze massive text data
R language dplyr package mutate_ The all function multiplies all numeric columns (variables) in the dataframe by a fixed value to generate a new data column, and specifies a user-defined suffix name f
Go write file permission WriteFile (filename, data, 0644)?
AAIG看全球6月刊(上)发布|AI人格真的觉醒了吗?NLP哪个细分方向最具社会价值?Get新观点新启发~
How can testers get started quickly when they change jobs to a new company?
Follow the promotional music MV of domestic tour in Thailand and travel to Bangkok like local people
R语言使用nnet包的multinom函数构建无序多分类logistic回归模型、使用回归系数及其标准误计算每个系数对应的Z统计量的值、使用pnorm函数计算Z统计量对应的p值判断变量的显著性
怎么手写vite插件
AssetBundle resource management
R语言使用MatchIt包进行倾向性匹配分析(设置匹配方法为nearest,匹配倾向性评分最近的对照组和病例组,1:1配比)、使用match.data函数构建匹配后的样本集合
Analyse et résolution des défaillances de connexion causées par MySQL utilisant replicationconnection
Tt-slam: dense monocular slam for flat environment (IEEE 2021)
R language uses the polR function of mass package to build an ordered multi classification logistic regression model, and uses exp function and coef function to obtain the corresponding odds ratio of
R语言使用MASS包的polr函数构建有序多分类logistic回归模型、使用summary函数获取模型汇总统计信息
The R language inputs the distance matrix to the hclust function for hierarchical clustering analysis, uses the cutree function to divide the hierarchical clustering clusters, specifies the number of
#yyds干货盘点# 解决剑指offer: 判断是不是平衡二叉树
If there is a problem with minority browsers, do you need to do a compatibility test?
Broadcast level E1 to aes-ebu audio codec E1 to stereo audio XLR codec
How to enable the SMS function of alicloud for crmeb knowledge payment
解决“Thread 1: “-[*.CollectionNormalCellView isSelected]: unrecognized selector sent to instance 0x7f”