当前位置:网站首页>How did Tencent's technology bulls complete the overall cloud launch?

How did Tencent's technology bulls complete the overall cloud launch?

2022-06-23 13:18:00 Programmer Xiaohui

What's the matter with the cloud ?

It's the Internet age , Cloud services have changed our lives , It also changed the whole IT industry . What exactly is cloud service ? Xiao Hui made an analogy in his previous articles :

There are... In the village 100 family , Every family should build their own house . If every family prepares wood and bricks in person , Laying foundation by oneself 、 Set up a beam 、 Wall laying and tile laying , This is equivalent to the traditional independent research and development ; If everyone goes to invite a professional carpenter in the village 、 Bricklayer 、 Painter , Command craftsmen to complete various basic work , This is equivalent to using the resources of cloud services .

Cloud services are developing very rapidly in China , In recent years, many excellent cloud service platforms have emerged , A friend of Xiaohui happens to be a member of the Tencent cloud team , I communicated with him a few days ago , Learned many impressive stories .

Although Tencent cloud has a large number of customers in China , However, Tencent is facing some problems left over by history : Each business line of Tencent often built its own wheels when it was first developed , Rely on a variety of underlying frameworks and interfaces . Time is long. , One is This leads to the disconnection between the technology and the mainstream technology system , Second, they also Impact on development efficiency .

To solve this problem , Tencent is inside 2018 In, the self research cloud strategy was launched , Until recently , Tencent announced that its massive internal self-research business has been put into the cloud , This is also the largest cloud native practice in China .

As a frontline farmer , Why go to the cloud ? Shangyun is the company's “ Political missions ” Or what programmers really want ? Xiao Hui chatted with a friend .

The original architecture is good , Why go to the cloud ?

Xiaohui's friend is Ma Tongxing, technical director of Tencent photon happy game studio .

As a national representative of leisure games ,“ Happy game ” The user volume is huge :2019 There are tens of millions of active users every year , At the same time, there are more than one million online players , Today, there are still tens of millions of daily activities .

Business people know , Such an evergreen business , There must be a stable technical architecture and mature operation behind it .

So here comes the question : The previous architecture worked well , Why go to the cloud ?

The answer given by Ma Tongxing is very brief , Because the cloud is right there . I understand , In his opinion ,“ cloud ” It is inevitable for the industry , It's the general trend , The cloud makes it impossible for all technology practitioners to turn a blind eye .

If we describe Ma Tongxing according to Xiao Hui's impression , These keywords can be roughly used to summarize : The technocrats 、 Talkative 、 Super learning ability 、 Willing to share 、 Embrace open source …… As the technical director of Tencent photon happy game , Cloud of happy series games is also a project initiated and led by him .

It is worth mentioning that , As early as 2019 Beginning of the year , Ma Tongxing made a decision : Based on open source solutions , Reconstruct the original technical architecture of happy games , And migrate the overall business to the cloud .

In the process of chatting with Ma Tongxing , The love and passion for cloud technology is beyond words . It doesn't need the company's existing services to transform , It's self research service , What pushed him to make this decision , At that time, the community increasingly received attention to the open source service grid (service mesh——Istio 1.1 Version launch . It's not long before we switch from a mature architecture to open source 、 There is no plan for large-scale implementation , Many students in the team are bottomless . The risk is too great Can't control Business pressure is too great , Similar concerns abound .

7de72f57eeb000a5760b9d6d4b69ed3d.png

But Ma Tongxing doesn't think so , He felt compelled to do Hard and right things . Xiao Hui chatted with him , Let Ma Tongxing tell the whole cloud story of him and his team , Hope to give you some inspiration :

“ If we don't do , Three years later, it will be far worse ”

in fact , We were talking about refactoring twoorthree years before the cloud launch .

Probably 2018 In the second half of , We started to do service discovery 、 Some pre research on traffic management , Research the community's programs and focus on Istio. To 19 Beginning of the year , That is, before the Spring Festival 2 Month of the month , We started to do technical validation . Just released at that time Istio 1.1, It's the first one Enterprise Ready edition , There are few large-scale applications in the industry , Tencent doesn't have such a service inside .

Our team read some technical materials together , I started to study this . At first I thought , Microservice architecture and governance of service grid , yes “ Old wine in new bottles ”—— I thought there was a lot of progress 、 Making each function smaller is called microservice , Later, I found that it was really not , The microservice governance here refers to the ability of system scheduling governance .

How to understand this governance capability ? Let me give you an example , Do you have 100 Services are doing different things , At this time A The service wants to access 30 individual ,B Services access other 20 individual ,C Services access other 15 individual , this 15 One of them will visit again A service … You think of it as 100 Personal cooperation means —— Interaction is dotted , Pull a body .

And the real microservice governance capability , No matter how detailed your services are 、 How complex the structure is , These services are scheduled 、 There is no need to pay extra attention to the details of disaster tolerance, fault tolerance and scaling . Its governance difficulty will not become complicated as your system becomes larger , At no significant additional cost .

All in all , It is to reconstruct the service and governance capabilities 、 Sink to the infrastructure floor 、 Avoid intruding into the business architecture for traffic management and scheduling . In my opinion, this design idea is more progressiveness than the traditional distributed architecture in the past .

The microservice governance architecture represented by this open source solution , That's what I said to do The right thing .

Our primary goal in the early stage is not to save costs , From the beginning, everyone went to the cloud with the goal of improving traffic management capability .

We extrapolate from the big technology trends and opportunities , At that time, I wanted to realize the automatic service discovery of some business modules , In fact, there are other low-cost ways to do , However, if you compare the technical architecture capabilities of the service grid and even the entire cloud native community , That would be a long way off .

such as , There is a fault in the large fan out system , In the past, you had to analyze the logs to see what went wrong , But for Yunyuan , The whole call link is very clear , Automatically generate call link topology for you , You will soon see the root node of the problem , It is a qualitative improvement to the efficiency of the whole research and operation .

Personally, I think this must be a big direction , It is even very important to improve the profit margin of cloud business . These capabilities are big opportunities and trends in the industry , If we don't do , Now it can also operate very well , But three years later , You may be far away .

“ Control and cognition of technology , It is a powerful weapon against fear and worry ”

When starting this thing , Many colleagues in the team are hesitant .

The person in charge of our studio attaches great importance to and spares no effort in technological innovation , The studio has set up a public technical team to take charge of technical pre research work , At the same time, there are technical teams in each business project team , If you want to make such a big architectural adjustment , In addition to the support of the boss, all core backbones need to be willing to do this from the bottom of their hearts , Can go on . Students in the project team , He will think more about project iteration and the requirements for version stability , Subconscious anxiety : Cut to a technical solution beyond my experience , Will there be a lot of problems , Is it also good to run fast in small steps under the original structure ?

At that time, I actually spent a lot of time talking with the technical backbones of various project teams : Why should we do it based on open source , What are the benefits of doing this for the development of the team and individuals , How can the risks and challenges of technology architecture adjustment be borne by the organization rather than the individual .

Some colleagues are very keen on technology , The response was quite positive . Other colleagues who prefer to ensure business priority will still hesitate , I will go on to propose , Let's start with small-scale projects , In this way, the confidence of the team was initially gathered .

The second thing is to unify the goal of the team . At that time, our team did many rounds of sharing and research , Even translated a book K8S The book of , Finally, the goal is unified : Based on a whole set of open source technology stack to refactor , Including from the bottom layer of the protocol , To open source remote procedure call system gRPC, To the service grid and K8S Service choreography for . The reason is also very simple : This set of things has gradually become the de facto standard of the industry , Don't follow it , We'll fall behind , It's impossible to beat someone else by yourself .

3149488d1834b4a59b6e0d2454b4305c.png

About 2019 year 6 month , The results of the pilot project validation have been evaluated : The scheme is feasible . At this time, the team atmosphere is completely different from that at the beginning , Become confident , Because we have thoroughly understood this technology . Control and cognition of technology , It is the most powerful weapon to eliminate fear and worry . Wait until the large-scale cloud reconfiguration later , Everyone is familiar with it .

Half a million people “ The universe moves ”

Technical validation this is the first node , To the second node , What we need to do is to smoothly transition the business to the cloud native environment .

But this cloud does not mean that moving our services to the self-developed cloud is called cloud , Suppose we just put IDC The production environment of has been moved to the self research cloud of Tencent cloud , But still use the original architecture , In my opinion, its meaning is greatly reduced .

The company has vigorously promoted self-development to the cloud, creating a very good atmosphere for technological change , And here it is 2019 year 6 About month ,TKE The team also started to do mesh 了 , And set up a joint support team with us . On this basis , We started to refactor the original business one by one , Put it in the cloud's native environment .

The technical difficulty of refactoring is not so high , But smooth service migration is a problem . At that time, part of our business was on the cloud native architecture , The other part is under the cloud . In this heterogeneous architecture , Ensure that the migration process is smooth 、 Users have no perception , It is a very challenging thing for the business .

Because games are different from ordinary Internet services , Let's say you buy something , If it fails, try again , But the game is interactive in real time , The game process of dropping the line and coming back has reached a new stage , Other players will not accompany you to repeat the missed game process . It's like saying , You have to move hundreds of thousands of people from one playground to another , They can't feel the same .

In fact, in the process of going to the cloud , There are doubts on the business side .2021 It's the Spring Festival , We and mobile phones QQ Cooperation in an operational activity , At that time, a large number of users rushed in , So there is an overload , That is, a large number of users queue up .

At that time, the operating students did not inform the R & D team of this activity in advance , Part of our business was still under the cloud , There is no dynamic elastic computing capability , So this fault happened . Later, we will sum up the reasons for this with the operation students , They don't quite understand , Say our system can support oneortwo million online , Why can't half a million people come ?

I gave them an example to explain : An office building can support 5000 people , But every morning the elevator line is very long , Our situation is the same as this . One is capacity , One is logging in to the concurrent load , The capacity can be very large , But it doesn't mean that we have a strong ability to deal with it in a short time . At present, this part of business does not have the ability of dynamic elastic computing , So it will be like this . We are now doing a large technical reconfiguration called cloud native reconfiguration , One of its core capabilities is to solve this problem : If only one person goes , This elevator will become very small , Save resources ; If 50000 people suddenly come , It will automatically become very wide , Let those 50000 people pass quickly .

After talking about the operation, the students will understand , In the follow-up collaboration, it is much more convenient . So after this, I concluded , The technical manager should give the team three confidence :

First of all , There is sufficient basic verification , Prove that the direction is reliable ;

second , Major reconfiguration is inevitably full of technical risks , Since you choose to do , Be committed to the team , As a technical manager, we must first bear the risk responsibility . If something goes wrong, don't blame the team blindly , Because doing nothing is the least risky ;

Third , In a way that non-technical people can understand , Extend the influence of technology to other teams upstream and downstream , Gain the trust and support of others , To promote technological development .

Remember 2020 year 11 month 17 The morning of 11 It is the first time that we have a full online upgrade across the cluster grid , After upgrading, some cross cluster routing information is lost , Lead to a large number of failures of an important game function . R & D students devote themselves to fault business recovery , Planning of the project team 、 The operation students make an announcement 、 User compensation scheme , Contact customer service to appease players . Everyone is unconditionally supporting , No one complained . After R & D students restore services and complete player data compensation , And TKE Together, the team thoroughly investigated the root cause of the problem . Finally in the afternoon 1744, The backstage students in the project team know that the service has been restored and has been put on the client function portal again , Not only is there no blame or doubt . We GM Instead, I care about everyone ,“ Many backstage students are busy dealing with problems , I haven't had lunch yet , hard .”   The expectations of the studio for technological innovation 、 Tolerance and patience are the major prerequisites , The project of steady operation is the scene where the reconstruction and evolution can be implemented , What is more valuable is the full cooperation and support of happy young partners with different professional backgrounds , These are important guarantees for the smooth implementation of the original reconstruction of happy cloud .

As early as 2020 Year of 6 month , All game matchmaking services of joy have been deployed to the cloud native environment , The game matchmaking service with complex and strong state real-time interaction also has elastic computing capability for the first time . As far as we know , This should be the first team to achieve this level in the game field .

Higher goals : From the peak 60% To full time 70%

in fact , Our three stages , Including technical verification , Including the migration of core services , Finally, we will transfer the stock services , We all have a big principle : Not in a hurry , The number of hours when the core is not assembled , Just look at one thing : Whether the cloud's native elastic computing power has been fully utilized , This criterion is that you can at least arrive when you are busy 60% The load of .

In fact, any game product has a life cycle , Under the cloud , Everyone is a pile of machines , To complete the load during the peak period , But after the rush hour , The actual utilization of those machines is very low . We've done calculations , Many projects are built without elastic computing , The peak payload is only 30% Less than , Lower load rate in low peak period .

If your business is on the cloud , The utilization rate is the same as that under the cloud , I think it's a very bad result . We can't take the cloud as a passive task assigned by the company , It is far from enough to complete the basic cloud index .

So we emphasize , Don't spell the speed of the cloud and the number of cores , But the quality of the cloud , We must reconstruct the business system according to the cloud's native capabilities , When you go up there, there will be real elastic calculation , Service scheduling 、 The capacity of traffic management .

When the old business is reconstructed and put into the cloud 、 After the heterogeneous system is eliminated , I believe that the overall resource utilization rate will reach a higher level . My expectation is that I can be full-time in the future 70%+, It's not just rush hours 70% utilization , But it can also be done in the low peak period 70%, Peak elimination and valley filling , Use idle computing resources for offline computing services , This is a long-term goal .

If you really do it , The number of core hours we use may be reduced to the present 1/3 Even lower , Of course , It's not an easy thing .

There is one thing that I am particularly touched by : Last summer, our project team organized a mission to Wugong Mountain in Jiangxi Province , It took hours to climb on foot , Finally, people really stand on that cloud , The cloud is fifty or sixty meters below your feet 、 A hundred meters away , Everyone is very excited , Start taking photos and sending them to friends .

We're really On the cloud The team . Many people write this in their circle of friends , At this time I know , We have this The right thing You really did it right .

1c8180ecfbd6eeebd1e42c4edbd9d4e8.png

( Happy game group construction photos )

People often say that , Ship disaster U-turn , With the huge volume of Tencent , It is self-evident that it is difficult to go to the cloud . however , Just like the story above , There are many people like Ma Tongxing in Tencent , Try bravely with your own technical experience and ideas , Finally completed one impossible task , Many old projects within Tencent are rejuvenated in the cloud .

When I hear these stories , Xiao Hui admires and admires .IT Most people are engaged in trivial business work , Can catch up with such a huge project to transform the original cloud , Both challenges and opportunities . When they learn and try , Work together to overcome difficulties , The moment when I finally finished going to the cloud , I believe that each participant's personal ability will be greatly improved .

This is the technical person of Tencent , This is the spirit of Tencent , I hope Tencent cloud can be supported by such a group of lovely technicians , The farther you go , Higher and higher .

原网站

版权声明
本文为[Programmer Xiaohui]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206231236018459.html

随机推荐