当前位置：网站首页>Evolution of Alibaba e-commerce architecture

Evolution of Alibaba e-commerce architecture

2022-06-25 07:32:00 【Programmer base camp】

Preface ： I found a report on the first Alibaba Middleware Technology Summit on the Internet ,2017 year 7 Of the month , The name of the report is the evolution of Alibaba e-commerce architecture , I felt good, so I watched it again , Record here , Copied most of it , And it was sorted out , Then he explained it in his own language in the obscure place , For personal review .

Reference resources ：

https://yq.aliyun.com/articles/147755

https://developer.aliyun.com/article/161190

Catalog

Chapter one TaoBao 1.0

Chapter two TaoBao 2.0

2.1 1.0 Problems with the architecture

2.2 2.0 framework

The third chapter TaoBao 3.0

3.1 2.0 Architecture problems

3.2 2.0 To 3.0 Architecture in the process

3.3 2.0 To 3.0 Problems encountered in the process of architecture

3.4.4 Distributed tracking

3.4.5 Other systems

Chapter four TaoBao 4.0

4.1 3.0 Architecture problems

4.2 4.0 structure

The fifth chapter New starting point

Chapter one TaoBao 1.0

The whole Taobao wants to create , To really go online , It took more than a month in total . What have you done in more than a month ？

First thing , We started to do technology selection , Decide how we will develop in the future ; The second thing , How in more than a month , Let our website go online .

We bought a set based on LAMP Structured e-commerce website , And get the source code , We carry out secondary development on it , For example, the interface UI changes , Up and down title The changes to the , One of the biggest changes is that we have made a read-write separation of its database .

66281ddd5d7ada9f107df93cb17110776f26c7da

Chapter two TaoBao 2.0

2.1 1.0 Problems with the architecture

With the growth of business volume , You'll find some bottlenecks , Mainly from the database . The database at that time was MySQL4, Not stable enough , The database often crashes .

therefore , Let's change the database directly to oracle, adopt PHP and oracle Connect directly to operate , but PHP Connection pooling is not supported , Even if you use some open source PHP middleware , Give Way PHP De link oracle, Still very unstable , The middleware of the connection pool is often stuck .

2.2 2.0 framework

We began to consider transforming the technical system into Java, because Java In enterprise applications , It has a relatively mature ecological environment . The process of transformation is also very bumpy ： First of all , We are an online running system ; second , The system was growing on a large scale . So replace the system with Java, The best way is to replace... In blocks . Simultaneous discovery ,oracle The amount of writing is still relatively large , At that time, I also made a search, Put product search and store search into search Inside , In this way, every request is made to the database , So we're done 1.0 Architecture to 2.0 Evolution of Architecture .

af3ede6d612024ed0ebc4d4000e7eb2f71953bad

The third chapter TaoBao 3.0

3.1 2.0 Architecture problems

3.2 2.0 To 3.0 Architecture in the process

So we started to upgrade the architecture . First of all , We added memory cache,cache It mainly solves the problem of excessive pressure on the database , We have developed a set of Key/Value Distributed cache （TAIR）, Is to add a memory in the front of the database cache, Eased the pressure on our database . second , We added a distributed file system （TFS）, When the previous file system was commercially available , The cost is too high , The number of servers is very large , So we developed our own file system .

6e0c07a6ff277801b9211e60b02ad28c1003a141

3.3 2.0 To 3.0 Problems encountered in the process of architecture

Technical team size 500 Left and right , Maintenance becomes more and more complex
A single War application , App packages have been growing , Updating business features is getting slower and slower ; Data gradually form multiple Islands , Can't get through
Based on traditional application development architecture , Business explosion , Not flexible enough , Single point of failure has a huge impact
And performance . With the increase of front-end business , When the number of servers increases ,oracle There is also a bottleneck in the number of connections . So we have to start making new architectures , Take the whole architecture one step forward .

3.4 3.0 framework

We are moving towards 3.0 framework . The system is split and becomes smaller , Splitting the system is mainly to layer the system . Divide the system into three categories ：

The first category is c class , Is the central class , For example, members 、 goods 、 Shops and so on , Develop their own systems based on these centers . For example, product details 、 Trade order ;

There are also some public classes , yes p class , Such as trading platform , This is a business split . Several well-known projects , For example, Qiandao Lake project （ Split the trading center 、 Category attribute center ）、 Five color stone project （ Split the store center 、 Commodity Center 、 Evaluation Center ）.

There is also a vertical team class .

With the change of technical architecture , Our business structure has also begun to change , Started to set up a corresponding team . The second half of our architecture is introduced . At the beginning of the , yes all in one,1~10 Maintaining a project . The second stage is 10~1000 People maintain MVC framework , The front and back ends are separated , Attend to each one's own duties . The third stage is RPC, Is to split each system , Then, the systems communicate with each other . The fourth stage is SOA Such a pattern .

Here, I have thought about the classification mentioned above , I don't think the above description is very clear , The feeling should mean the following , No guarantee, right ：

C Class is focused on providing a service , Such as member services 、 Goods and services 、 Shop services .

P Classes are classes that provide public services , Such as payment , Different business （ TaoBao 、 Flying Pig ） All need to deduct money from the bank through this kind of .

Vertical team classes provide some compositing Services , If there is an interface for reading commodity information , This interface needs to call C Class 、 Shop services , Vertical teams implement this interface .

d5e76b954d51db3b112ee5561a0fd34ff9dcec9c

3.4.1 RPC frame

We developed a lightweight HSF frame , It is based on Java interface Of RPC frame , Make the development of the system call normally like the development of local applications Java. Using this framework, you can really call other systems remotely .

commonly RPC With service discovery , What about service discovery ？

An application A How to know the application B How many machines are there in it , Yours ip What is it again? ？ The simplest way is to use a static list , Record ip, Make polling strategy , First call A Of 1 No , Call again 2 No . This makes it impossible to achieve a dynamic discovery . So in the process , We have a dynamic configuration center （configserver）, When your service goes online , As provider Put the service on the market configserver Up , When you need to consume this service , Will see which services can be called , Then take this list to .Configserver Will automatically send the corresponding provider Of ip Pushed to the consumer above . then consumer Will automatically discover provider Service for , Then there will be a mutual calling relationship with each other . If configserver After hanging up , Yours provider It won't hang up , But the services that have been sent will not be affected , because configserver The corresponding services have been pushed to consumer above . When we make distributed systems huge , In fact, each department of each system has its own position .

3.4.2 Database split

Oracle In fact, there are performance bottlenecks , and MySQL After years of development , Has been very mature and stable . We consider splitting the database , That is to say IOE. Yes MySQL To break up , In fact, it is to divide the database and table according to certain rules . The first thing to split is to separate reading and writing , Then do a vertical split , And horizontal split . Vertical splitting is mainly based on business , When vertically split to a certain extent , Some large businesses still can't afford this amount of data , We can only do horizontal sub database and sub table , do sharding The split . Sub database and sub table are calculated based on some primary keys , If the primary key meets what conditions , Just Sharding To what server . But the cost of each business system is very high , There must be something common in the middle , It can solve the problem of horizontal splitting of database .

be90a65f030b75d72628620cd6b746c2e5eb1ce5

So we developed a set of database middleware called TDDL.TDDL It is to support the horizontal splitting of databases at the middleware level , Business is just like writing a single library , You don't need to feel too much , But I have scattered the data into various databases . There was another system at that time CORONA,CORONA Today we have put him on the cloud , It follows the standard JDBC agreement , Applications still follow standards when writing code JDBC agreement , You don't need to feel anything at all , It's also standard JDBC package . Send the request to our server above ,server To do Sharding Handle , The whole application is completely unaware .

3.4.3 MQ colony

for example , I want to create an order , The order depends on 200 Multiple systems , If you follow the synchronous call step by step , The return time of the final return may be very long , What do you do then ？ We will choose concurrent ,A To call B、 To call C、 To call D This kind of HSF Call directly . But there is a problem , If the downstream depends on 200 One of the multiple systems is suspended , The whole request is suspended , Then it's hard to do , And at this time, if the other party's system has serious problems , Will cause my subsequent requests to be suspended , It will eventually bring down my system . At this time, we need an asynchronous decoupling method , Then message oriented middleware comes into being . Message oriented middleware is when A When you want to call , Then a message will be sent , Then the downstream system starts to subscribe to this message , Deal with their own , After processing, the result is returned to the middleware , This completes the asynchronous communication process . So if something goes wrong with one of these systems , When the front-end trading system creates an order , As long as it sends the message, it doesn't care , Wait until everything is handled and then call it back , You won't care how you call .

9f3511d1a1c96beedfae16481609a7889957b041

3.4.4 Distributed tracking

As our entire distributed architecture evolves , The architecture becomes extremely complex , Dependencies also become extremely complex . At this time, we wonder if we can visualize online problems , So that we can know what happened 、 What is the calling relationship and link between them . So distributed tracking （EAGLEEYE）. With EAGLEEYE, We can clearly know that a request comes , How to pass from the entrance to the end , What happened in the middle , Then which piece may be problematic . As shown in the figure, the wrong position will be marked in red , We can clearly know which system is the problem . We don't need to be like before , Everyone is checking their own systems , It takes us a long time to deal with the problem .

b7e601b032a7013cf786d39096afc4d080fdcacb

3.4.5 Other systems

Foolish Old Man ： With sub database and sub table , How do we put oracle Data migration to MySQL？ Regarding this , We have developed several middleware , The first one is “ Foolish Old Man ”, hold oracle The data inside passes “ Foolish Old Man ” Bit by bit to MySQL Go inside , Put it in each library , At the same time, we guarantee that our business will not be affected .

mythical bird trying to fill up the sea with pebbles ： When we divide the database into databases and tables , We also need to do something between the database and the cache trigger, When the data changes , Need to trigger an event , Maybe we need to write a program to realize , Now we have also precipitated a set of middleware system —— mythical bird trying to fill up the sea with pebbles , It will listen for changes in each database , When every record in the database changes , It triggers an event , After listening to this event , The business side can write a Jingwei according to its own business needs worker, Then put it in the Jingwei , Trigger the corresponding logic . The most typical is to trigger cache Logic . With us IDC After architecture , When our data is written at a single point , How can other regions perceive changes in data ？ It is realized through such a system as Jingwei . When the database changes , Jingwei will trigger failure cache, When the business request comes again , It will cache The data inside is filled with the latest data , Then the business can see the latest data , It won't happen when A The unit data has changed , then B Unit sum C The unit's cache Not in effect .

Chapter four TaoBao 4.0

4.1 3.0 Architecture problems

The most important problem in resources is resource constraints , When our computer rooms are all in one place , This place can't expand indefinitely , As we have more and more servers , Then our server may not fit in this place . such as 2013 After we bought the machine in , The computer room in Hangzhou has no place to put . With our double 11 activities , Along with the great increase in sales and second peak , Our cost will also increase to a certain extent , We will eventually encounter the limitation of single regional resources ; The second is scalability , Some businesses may not be deployed in just one place , Because I will be slow when others visit , Need to deploy abroad , At this time, the business has a need for remote deployment ; The third is the need for disaster recovery , After all, natural and man-made disasters are inevitable . For example, it is also in 2013 year , Hangzhou is 40 High temperature , Our computer room was almost cut off , Fortunately, there is no limit in the end . But it also gives us a warning , That is, we must evolve our architecture . If it doesn't evolve , One day our resources will be insufficient .

e8eded8619c79cc89f05753ce2e45c2a7e42b16e

4.2 4.0 structure

Architecture evolution is not to put eggs in the same basket , Divide our business units , You can put them everywhere , Then let our whole system spread all over the world , There is no strong dependence between various systems , When something goes wrong in a region , It doesn't affect other places , We just need to switch the traffic . Now we are applying this architecture .

According to the business dimension , Divide the business into logical units . For example, the first thing to do unitization is the trading unit , Let's divide the whole transaction link , Put it in each logical unit , Then split it horizontally , Then make a horizontal distinction between the data . Don't send data across cells . If you cross cells, there will be some problems , For example, disaster recovery , If A After hanging up , It may not only affect A, Other units will also be affected . If a cross cell call occurs , The delay will also be relatively long , The experience of placing orders for end users is also very poor . So we should follow the logic of unit closure , Let the calling relationship in each unit be enclosed in its own unit , Do not cross cell .

fe348e472cdeac7cbdf405872bebc524b9111001

In terms of technical architecture , We have made a layering of Technology , And set several principles , In addition to the principle of unit closure , And global routing , Global routing solves the allocation of global user traffic , When a user traffic comes in , It will be allocated to the corresponding units according to our routing rules . When user traffic comes in , Will jump to CDN,CDN Know which unit it belongs to . Then when you get to a unit , The access layer will determine whether it belongs to this unit , If it doesn't belong to , Will make it jump into the right unit . If you belong to this unit, go on , Until we get to the database layer . If there is a problem with the database layer , What we did was that the database failed to write , Nor can it be written successfully , So the data should be consistent . At this time, there are new problems about data consistency , For example, we write it into each unit according to the buyer's order , But how does the seller deliver the goods ？ Sellers are actually placed in various centers , All order data of the buyer should be synchronized to the seller . Our model is ultimate consistency , Just not very sensitive to time , I can synchronize slowly , for instance , The buyer placed an order , It takes a second or two to sync to the seller , In fact, this time is acceptable to everyone . For strongly consistent data , Let's cross unit , Pick up data at the same place . It's the red part above the picture —— Strong central dependence , So this is the core of our trading chain —— Cross cell dependencies .

0bf4d7119cf9ec4f0d4aabd7a7781cec709ed00d

I'd like to sum it up by feeling , Take the shopping process as an example , It also coincides with the above figure ：

First, it is split horizontally , You can see that it is split into different units , Each unit contains all the product data , And the buyer's personal data , In this way, the user can browse the catalog , Check the details , All operations such as placing orders can be limited to this unit , Then we will deploy this unit to Beijing , Users in Beijing don't have to visit servers in Hangzhou , All the problems can be solved by directly accessing the server in Beijing .

The basic logic is this , But there are several problems , The first is after the user places an order , Unit of data in Beijing , Another user places an order and the data is in Hangzhou , How do businesses know how many people have bought it , Generally, a central unit will be set up , As shown in the figure above , The central unit stores the data of all buyers' orders, as well as the full data of sellers 、 Full volume data of commodities , Different cells update the data of the central cell , When the buyer places an order in Beijing , In addition to updating this module , The central unit is also updated , In this way, we can know the buyer's situation from the intermediate unit , Then according to the buyer's shipment . It is not clear whether the update center unit is synchronized , It doesn't feel synchronized , Syncing to the seller after oneortwo seconds means that you can sync to the seller through MQ This asynchronous synchronization to the data center , But make sure that the asynchronous synchronization is successful , Or a Beijing unit will hang up , Beijing users see why the order I just placed is gone .

There are also some key data , Such as user wallet balance 、 inventory , All these units need to be used , Strong data consistency is also required , It's all in the central unit , Each unit is taken as the central unit , Not stored in private units , Corresponding to the strong center dependence of the above figure .

This is our whole multi live architecture in different places , The problem of disaster recovery is solved by multi living in different places 、 There are also resource issues and business scalability issues . As long as we find that we don't have enough resources , We just need to create a new unit , You can expand the capacity . The first call unifies the resources , The station building platform can quickly build a unit by calling . When the unit passes the full link voltage test , Our whole unit can be used , In this way, the problem of capacity is solved . Then the problem of disaster recovery can be solved through global routing . When something goes wrong with a unit , We just need to quickly switch the traffic out . Business scalability is a natural feature of the whole architecture , We have deployed this unit thousands of miles away .

207c871dd8a9f4830dfe346f6308b469f9324524

The following is about the next full link voltage test and high availability , This is not much , I am not familiar with , I didn't write , If you are interested, you can open the address at the beginning of your blog to have a look .

The fifth chapter New starting point

Why not Taobao 5.0, But we should call it a new starting point ？ I think it's because young people don't speak 5 Of , Ali's team 2017 It was predicted that 2020 What will happen in , I really admire you （ dog's head ）.

Clouds will become basic resources like hydropower and coal , More and more businesses will be presented on the cloud , Many of these businesses will experience the same development as Taobao , We will accelerate the development of these businesses , Create more value , Drive business with technology , Export our technical capabilities to the cloud .

Now we are not only serving our double 11 , We also want to provide technical services for other enterprises , Drive them with technology , Let them only care about business , Don't care too much about how the bottom layer is implemented . Above are some products in the cloud , You can see directly on Alibaba cloud , image DRDS、EDAS、MQ etc. .

Now, Ali's architecture is called Aliware.Aliware Enterprise level Internet architecture has been provided for enterprises on the cloud , Support a lot of enterprises .

14166198f39bb177bdb8b5eaf9097dda053222c0

analysis ：

This means cloud , In fact, there are many middleware mentioned above , Such as MQ、RPC、 Distributed database middleware 、 File servers, etc , In fact, it is used in every big project , It is troublesome to build a cluster for each system , It can be extracted to form a system , The so-called cloud , Not only can it be more professional , It can also be provided to various projects , Even other enterprise projects , Let developers just focus on the implementation of the business . It is said that a lot of Taobao has been on the cloud , It seems that the more important payment system is also on the cloud , This is not clear , If you are interested, you can go and have a look by yourself .

I like it very much , I used to build my own set RocketMQ colony , Now I feel blood loss , It's a bit of a waste of time , It's better to set up a set of Alibaba cloud , Convenience doesn't have much money , It's gone. It's gone .

原网站

版权声明
本文为[Programmer base camp]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202201233270791.html