当前位置：网站首页>Research and development practice of Kwai real-time data warehouse support system

Research and development practice of Kwai real-time data warehouse support system

2022-06-26 12:40:00 【Flink_ China】

Abstract ： This article was compiled by litianshuo, a technical expert of the Kwai real-time computing data team, in Flink Forward Asia 2021 A special speech on real-time data warehouse . The main contents include ：
Business features and pain points of real-time data warehouse guarantee
Kwai real-time data warehouse support system architecture
Real time guarantee practice of spring festival activities
The future planning

Click to view live playback & speech PDF

One 、 Business features and pain points of real-time data warehouse guarantee

The biggest business feature of Kwai is the large amount of data . The daily inlet flow is trillions . For such a large flow inlet , Reasonable model design is needed , Prevent excessive consumption of repeated reads . In addition, in the process of data source reading and Standardization , The extreme pressing performance ensures the stable implementation of inlet flow .
The second feature is the demand for diversity . The requirements of Kwai business include the scenario of large active screen 、2B and 2C Business applications for 、 Internal core Kanban and real-time search support , Different scenarios have different requirements for support . If you do not do link classification , There will be confusion of high and low priority applications , It will have a great impact on the stability of the link . Besides , Because the core of the Kwai business scenario is to do content and creators IP, This requires us to build common dimensions and common models , Prevent repeated chimney construction , And quickly support application scenarios through common models .
The third feature is the frequent activity scenes , And the activity itself has a high demand . The core demands are mainly in three aspects ： It can reflect the traction ability of the company's overall indicators 、 It can analyze the real-time participation and adjust the playing strategy after the activity starts , For example, through the real-time monitoring of red packet costs, we can quickly perceive the effect of activities . Activities usually have hundreds of indicators , But only 2-3 Weeks of development time , This requires high stability .
The last feature is the core scene of Kwai . One is the core real-time indicators provided to senior executives , The other one is for C The real-time data application of the end , For example, Kwai store 、 Creator center, etc . This requires extremely high data accuracy , Problems need to be perceived and dealt with immediately .

The above elements constitute the necessity of Kwai's real-time data warehouse construction and guarantee scenario .

In the initial stage of real-time data warehouse support , We have learned from the guarantee process and specifications on the offline side , It is divided into three stages according to the life cycle ： R & D stage 、 Production phase and service phase .

R & D stage The model design specification is constructed 、 Model development specification and released checklist.
Production stage Mainly build the underlying monitoring capability , For timeliness 、 stability 、 Accuracy is monitored in several aspects , And according to the monitoring ability SLA Optimization and governance improvement .
Service stage The service standards and guarantee levels for upstream docking are specified , And the value of the whole service .

But compared to offline , Real time learning is expensive , After completion of the above construction , There are still several problems in each settlement ：

R & D stage ：Flink SQL The learning curve of Hive SQL Higher , It is easy to introduce hidden dangers in the development stage . in addition , Real time computing scenario , Can you consume quickly when the activity is at its peak , Also an unknown quantity . Last ,DWD The repeated consumption of layer is also a great challenge to the resources on the real-time side , Resource issues need to be considered when selecting data sources and dependencies .
Production stage ： state Without a clean-up mechanism, the state becomes larger 、 Frequent job failures . In addition, high priority and low priority deployments require machine room isolation , Therefore, it needs to be arranged before going online , Make adjustments after going online , The cost will be much higher than offline .
Service stage ： For a real-time task , The most unacceptable thing is the workflow failure 、 restart , Problems that cause data duplication or curve cratering . To avoid such problems , A standardized scheme is needed , The offline high probability can ensure the data consistency after restart .

In the abstract , Real time data warehouse compared to offline , There are still several difficulties in ensuring , Specifically reflected in the following aspects ：

High timeliness . Compared to offline execution time , In real time , The delay of minutes will involve the operation and maintenance , It requires high timeliness .
complexity . It is mainly reflected in two aspects ： On the one hand, the data is not imported , Data logic verification is more difficult ; On the other hand , Real time is mostly stateful , When a service fails, the state may not be completely saved , There will be many things that cannot be reproduced bug.
Big data flow . Holistic QPS Relatively high , The inlet flow level is 100 million .
Problem randomness . The time when problems occur in the real-time data warehouse is more random , There is no rule to follow .
Development capacity is mixed . How to ensure the unity of development schemes for common scenarios , Prevent uncontrollable problems caused by different development schemes .

Two 、 Kwai real-time data warehouse support system architecture

Based on the difficulty of the above guarantee , We designed two ideas to solve , There are mainly two aspects ：

On the one hand, it is a positive guarantee idea based on the development life cycle , Ensure that there are specifications and program guidance for each life cycle , Standardization 80% General requirements of .
On the other hand, it is a reverse guarantee idea based on fault injection and scenario simulation , Through scenario simulation and fault injection , Ensure that safeguard measures are truly implemented and meet expectations .

2.1 Positive guarantee

The overall idea of positive guarantee is as follows ：

The development phase Mainly do demand research , How to develop the basic layer in the development process 、 How to develop and standardize the application layer , Can solve 80% The general needs of , The remaining 20% The individualized requirements of the project are met through scheme review , At the same time, standardized solutions are continuously precipitated from personalized needs .
Testing phase It mainly performs quality verification, offline side comparison and pressure measurement resource estimation . The self-test phase is mainly through offline real-time consistency comparison 、server Kanban and real-time results are compared to ensure overall accuracy .
Go live The plan is mainly prepared for the launch of important tasks , Confirm the action before going online 、 Deployment mode during the launch and patrol mechanism after the launch .
Service stage It mainly focuses on the monitoring and alarm mechanism for the target , Make sure the service is in SLA Within the standard .
And finally Offline stage , It mainly does resource recovery, deployment and restoration .

The real-time data warehouse of Kwai is divided into three levels ：

First of all ,DWD layer .DWD The layer logic side is relatively stable and rarely personalized , There are three different formats of data for logical modification ： client 、 The service side and Binlog data .
- The first operation is to split the scene , Because the real-time data warehouse has no partition table logic , So the purpose of scene splitting is to generate sub topic, Prevent repeated consumption topic The data of .
- The second operation is field standardization , This includes the standardization of latitude fields 、 Filtering dirty data 、IP Operation of one-to-one mapping relationship with longitude and latitude .
- The third is to deal with the dimension association of logic , The association of common dimensions should be as far as possible DWD Layer complete , Prevent excessive downstream flow dependence from causing excessive gauge pressure , Usually, dimension tables are created by KV Storage + Second level cache to provide services .
second ,DWS layer . There are two different processing modes ： One is based on dimension and minute level window aggregation DWS layer , Provide support of aggregation layer for downstream reusable scenarios ; Second, single entity granularity DWS The layer data , For example, the aggregated data of core users and devices in the original log , Can greatly reduce DWD The correlation pressure of layer large data volume , And it can be reused more effectively .DWS Dimension expansion is also required for layer data , because DWD The amount of layer data is too large , It can't be completely cover Scenarios associated with dimensions , So the dimensions are related QPS The demand is too high and has a certain delay , Need to be in DWS Layer complete .
Third ,ADS layer . Its core is dependency DWD Layer and the DWS The data of the layer is multi-dimensional aggregated and the final output .

Based on the above design ideas , It is not difficult to find out that DWD and DWS The logic of flow splitting 、 Field cleaning standardization and dimension association , They all aim at different formats but have the same logic . Basic logic can be developed into templating SDK, The following same logic uses the same SDK API Method . This has two advantages , Duplicate logic doesn't need to copy the code again , Some optimization experiences and lessons are also deposited in the template .

in the light of ADS The layer data , We precipitate many solutions through business needs , For example, multi-dimensional PV/UV How to calculate 、 How to calculate the list 、 Index card SQL How to express and how to generate scenarios in which the distribution class exists fallback .

SQL He is very handy 、 Efficient , It can simplify the development time on a large scale , But its execution efficiency is compared with API There are certain disadvantages , So for basic reservoir and DWS Layer large traffic scene , We still use API Development , The application layer passes through SQL Development .

Most of the activities of Kwai , The most concerned indicators of the business are the number of participants in some dimensions 、 The cumulative curve of receiving money , And hope to be able to produce a calculation per minute 0 The curve from the point to the current time , Such indicator development covers 60% Left and right activity side demand . What are the difficulties in the development process ？

Use regular scrolling windows + The calculation of user-defined status has a drawback in de duplication of data ： If the windows are out of order , It will cause serious data loss , Affect the accuracy of the data . If you want more accurate data , You have to endure greater data latency , If you want to reduce the delay, there may be inaccurate data . Besides , Under abnormal circumstances, there will be scenarios in which data can be traced back from a certain point in time , In the backtracking scenario, increasing the throughput will result in the loss of intermediate results due to the maximum timestamp .

To solve this problem , Kwai has developed a solution for progressive windows , It has two parameters , Day level windows and minute steps for output . The overall calculation is divided into two parts , First, a day level window is generated , Read the data source according to key Carry out cylinder separation , hold key The same data is divided into the same cylinder , Then proceed according to the event time watermark advance , Exceeding the corresponding window step will trigger window calculation .

As shown in the figure above , key=1 The data of is divided into the same task,task watermark The output will be merged after updating to the small window generated by exceeding the step size bitmap and pv Calculated results of , And send it to downstream data , according to servertime Fall to the corresponding window , And through watermark Mechanism to trigger . stay global window During cylinder closing operation , It will accumulate and remove the weight of the results of cylinder separation , Final output . In this way, if there is out of order and late data, the data will not be discarded , Instead, the time node after the delay will be recorded , Better ensure the accuracy of the data , The overall data difference is from 1% Down to 0.5%.

On the other hand ,watermark Step size exceeded window The window triggers the calculation , The curve delay can be controlled within one minute , Better guarantee the timeliness . Finally through watermark Controlling the window output of the step size can ensure that each point of the step size window is output , The output curve guarantees smoothness to the greatest extent .

The above figure is a specific SQL Case study , The interior is a deviceID Split cylinder , And then build cumulate window The process of .window There are two parts , One is the calculation parameter accumulated by day , The other is watermark Parameters for dividing windows , The outer layer will aggregate and calculate the indicators produced by different cylinders .

In the online phase , The first is to do a good job in the guarantee specification of the timeline , Including time 、 Operator 、 Contents of the plan 、 Operation records and checkpoints .

Before the event , The deployment task ensures that there are no computing hotspots 、check Whether the parameters are reasonable 、 Observe the operation and cluster conditions ;
In the activity , Check whether the indicator output is normal 、 Patrol inspection of task status, troubleshooting and link switching in case of problems ;
After the event , Offline activity task 、 Recycle active resources 、 Resume link deployment and recovery .

The link here is from Kafka The data source begins to be imported into ODS、DWD、DWS layer , in the light of C End users will import to KV In storage , The analysis scenario will be imported into ClickHouse, Finally, the data service is generated . We divided the task into 4 Level ,p0 ~ p3.

P0 The task is to move the large screen ,C End application for SLA The requirements are second delay and 0.5% Internal error , But the overall guarantee time is relatively short , The general activity cycle is 20 Days or so , New year's Eve activities 1~2 Days to complete . Our solution to the delay is to Kafka and OLAP The engines are disaster tolerant in multiple computer rooms , Aim at Flink Hot standby dual machine room deployment has been made .
in the light of P1 Level tasks , We are right. Kafka and OLAP The engine is deployed in two machine rooms , On the one hand, the deployment of dual computer rooms can be used for disaster recovery and escape , On the other hand, the configuration of the online computer room is better , It is rare that a machine failure causes a job to restart .
in the light of P2 and P3 Level tasks , We deploy in the offline computer room , If there are some resource vacancies , It will stop first P3 Mission , Free up resources for other tasks .

The service phase is mainly divided into 4 A hierarchical ：

First of all ,SLA Monitoring mainly monitors the quality of overall output indicators 、 Timeliness and stability .
second , Link task monitoring mainly monitors the task status 、 data source 、 Treatment process 、 Output results as well as... Of the underlying tasks IO、CPU The Internet 、 Information monitoring .
Third , Service monitoring mainly includes service availability and delay .
Finally, the underlying cluster monitoring , Including the underlying cluster CPU、IO And memory network information .

The goal of accuracy includes the following three parts ： Offline real-time index consistency is used to ensure that the overall data processing logic is correct ,OLAP The consistency between the engine and the application interface is used to ensure that the processing logic of the service is correct , The indicator logic error alarm is used to ensure that the business logic is correct .

Accuracy alarm is divided into 4 In terms of , accuracy 、 Volatility 、 Consistency and integrity . The accuracy includes some comparisons between the active and standby links , Whether the dimension drill down is accurate ; Volatility is the range of volatility that measures sustainability , Prevent the abnormality caused by large fluctuation ; Consistency and integrity through enumeration and index measurement to ensure that the output is consistent and there is no incomplete situation .
The goal of timeliness also has 3 individual , Interface delayed alarms 、OLAP Engine alarms and interface tables Kafka Delay alarm . Split to link level , Again from Flink Task input 、 Processing and output are analyzed ： Enter the core concerns about delay and out of order situations , Prevent data from being discarded ; The processing core focuses on the amount of data and the performance indicators of processing data ; Output focuses on the amount of data output , Whether current limiting is triggered .
The goals of stability are 2 individual , One is service and OLAP Engine stability 、 Batch flow delay , The other is Flink Recovery speed of the job .Flink Homework failover Can I recover quickly afterwards , It is also a great test for the stability of the link . Stability mainly focuses on the load of job execution , And the status of the corresponding service dependency 、 The load of the whole cluster and the load of a single task . We alarm through the target , Monitor the sub targets of target disassembly , Build an overall monitoring and alarm system .

2.2 Reverse protection

It is difficult to simulate the real online environment and pressure test progress when online activities are normal , Therefore, the key point of reverse guarantee is to test whether it can withstand the flood peak when the activity flow is expected , And how to deal with the failure ？

The core idea is to simulate the real scene of active flood peak through pressure test drill . First, determine the resource distribution of each job and the arrangement of the cluster where the job is located through the single job stress test , Ensure that the cluster resources are used at a certain level and the consumption peak is stable through the full link voltage test , Not too big or too small . secondly , Carry out disaster recovery construction , Mainly for job failure 、 Delay in consumption 、 Some guarantee measures are put forward for machine room failure . then , Through practice , Ensure that these means can be used normally and can achieve the desired results . Last , Improve the replay and link risk according to the expectations and objectives of the drill .

We have built our own piezometric link , Above is the normal link , The following is the voltage measurement link . First read the online topic As the initial data source of the piezometric link , utilize rate limit Algorithm for flow control . Such as the 4 individual task, Hope to get 1 ten thousand QPS, Then each task Generated QPS Will be limited to 2500, In the process of generating data, the corresponding... Will be modified by using the crowd package user And the generated timestamp , Simulate the actual number of users on that day .

Read the data source of pressure measurement topic After job processing, a new topic after , How to judge whether the pressure measurement has really passed , There are three criteria ：

First of all , Ensure that the job input read latency is milliseconds , And the operation itself has no back pressure .
second ,CPU The utilization rate of does not exceed that of the overall resources 60%, Ensure that the cluster is free buffer.
Third , The calculation results are consistent with the crowd package , Prove that the logic is correct .

After single operation pressure measurement , We can get a lot of information to guide the follow-up work . such as , It can be proved that the activity can guarantee... Under the expected flow SLA, You can discover job performance bottlenecks , Guide optimization to achieve corresponding standards and scenarios benchmark, Facilitate resource deployment for low priority jobs .

After completing the pressure test of single operation , It is still impossible to judge whether all jobs are fully started . about Flink The whole machine room CPU、IO also memory Pressure, etc , We can start each operation according to the pressure measurement target value , Observe the overall operation and the performance of the cluster .

So how to judge whether the full link voltage test passes ？ There are also three criteria ：

First of all , Ensure that the job input read latency is milliseconds , And no back pressure .
second ,CPU The overall utilization rate shall not exceed 60%.
Third , Finally, the calculation results are consistent with the crowd package .

After the full link voltage test , It can be proved that the activity can guarantee the peak of expected traffic SLA, Make sure QPS Resource arrangement of the operation under the action , Determine the resources and deployment parameters required for each job in advance , Ensure the maximum flow information upstream of each data source , Provide the basis for the subsequent current limiting guarantee .

There are two ways of failure drill ：

One is a single job fault drill , Include Kafka topic Operation failure 、Flink Job failure and Flink Homework CP Failure .
Second, more systematic faults , For example, link failure , For example, how to ensure normal output in case of single room failure , How to avoid avalanche effect when the activity flow is much higher than expected ？ A job lag Over an hour , How long will it take to recover ？

Disaster recovery construction is divided into two parts , Link fault tolerance and link capacity guarantee .

The core of link fault tolerance guarantee is to solve the problem of long recovery time of single room and single job failure and the problem of service stability .Kafka It can do disaster recovery for dual computer rooms , The generated traffic will be written to the two machine rooms Kafka, In case of single machine room failure, the flow will be automatically switched to another machine room , And guarantee Flink No perception of the operation . On the other hand, after the computer room is restored , Can automatically detect Kafka The state of the machine room adds traffic .

Again , Disaster recovery strategies also apply to OLAP engine . Aim at Flink Mission , We have deployed dual links for hot standby , The active and standby links are the same as the logic , When a machine room fails, the application side can be directly connected OLAP The engine switches to another link to use , Ensure that the application end is not aware of the fault .

The guarantee of link capacity is to solve two problems ： If the activity flow is much higher than expected , How to ensure stability ？ If it does lag, Assess how long it takes to catch up with consumption delays ？

According to the results of the previous full link voltage measurement , You can get the maximum flow of each task entry , And take this flow value as the maximum flow limit of the operation , When activity traffic is higher than expected , The data source side will trigger the read current limit ,Flink The operation will be carried out according to the maximum load of pressure measurement . Although there is a delay in homework consumption at this time , But it can protect the normal operation of other jobs in the link . And after the flood peak , According to lag Data and inlet flow to calculate the time required for the operation to return to normal , This is the core measure for link fault tolerance and capacity guarantee .

3、 ... and 、 Real time guarantee practice of spring festival activities

Spring festival activities have the following needs ：

high stability , Massive data requires the link to remain stable as a whole or be able to recover quickly in case of failure .
High timeliness , Billion level traffic , Large screen indicator card second level delay is required 、 curve 1 Minute level delay .
High accuracy , In case of complex links , The difference between offline and real-time indicators shall not exceed 0.5%.
High flexibility , It can support multidimensional analysis application scenarios in the activity process .

The overall plan of spring festival activities is divided into positive and negative safeguard measures .

The basis of positive safeguard measures is the monitoring and alarm system , In two parts . On the one hand, it is about timeliness 、 accuracy 、 Stability does SLA Target alarm construction . The other is the construction of link based monitoring system , Including link monitoring 、 Link dependent service availability monitoring and cluster resource monitoring .

Based on the monitoring system , The positive safeguard measures are mainly in the development stage 、 Standardization of test phase and launch phase . The development phase 80% The needs of are addressed through standardized templates , and 20% The remaining requirements can be reviewed to solve the risk problem . In the test phase, the logic accuracy is ensured by comparison , Deployment by stages and task patrol inspection shall be carried out during the launch phase .

Reverse safeguard measures need to build two basic capabilities . The first is the pressure measurement capability , It mainly determines the task performance bottleneck through the single job pressure test , So as to better guide the optimization ; Determine whether the operation can carry the flood peak through the full link pressure test , And provide a data base for disaster tolerance . Disaster recovery capability is mainly deployed through multiple computer rooms 、 Current limiting 、 retry 、 Downgrade , Ensure that there is a corresponding scheme in case of failure .

Finally, through the fault drill , On the one hand, the fault location of each component is introduced , On the other hand, simulate the peak flow , Ensure that the pressure measurement and disaster tolerance capabilities are truly implemented .

Finally, in the online stage, before the activities are guaranteed through the timeline plan 、 in 、 There are traces of the following operation steps , Recheck the project after the activity , Identify problems and feed them back to the capacity-building of the security system in both positive and negative directions .

The practice of spring festival activities has achieved great success . Timeliness , Facing hundreds of millions of traffic peaks , Large screen core link index card second delay , Curve type delay in one minute , The amount of data processed by a single task is above the trillion level , There is a second delay during peak traffic . accuracy , The offline and real-time tasks of the core link are different 0.5% within , There is no data quality problem during the promotion , To use effectively FlinkSQL Progressive window development , Greatly reduce the accuracy loss caused by window loss , Data differences from 1% drop to 0.5%. In terms of stability , The core link relies on the establishment of dual computer room disaster recovery 、Flink Cluster hot standby dual link deployment , Second level switch in case of problem , Precipitation of pressure measurement and disaster tolerance capacity , Lay the foundation for the construction of the future activity guarantee system .

Four 、 The future planning

Based on the existing methodology and application scenarios , We have also extended our future planning .

First of all , Guarantee capacity building . Form standardized script plan for pressure test and fault injection , The implementation of the plan is automated through the platform capability . After pressure measurement , Be able to intelligently diagnose problems , Precipitate some past expert experience .
second , Batch flow integration . During the application scenario of past activities , Batch and flow are two completely separated systems , We have practiced the integration of flow and batch in some scenarios , And is promoting the overall platform construction , Through unification SQL To improve the overall development efficiency , And the machine can reduce the working pressure by staggering the peak .
Third , Real time data warehouse construction . By enriching the content level of real-time data warehouse , As well as the precipitation and SQL The means of modernization , Achieve the improvement of development efficiency , Finally, the goal of cost reduction and efficiency improvement is achieved .

Click to view live playback & speech PDF

more Flink Related technical issues , Can scan code to join the community nail exchange group
Get the latest technical articles and community trends for the first time , Please pay attention to the official account. ～

O1CN01tmtpiy1iazJYZdixL_!!6000000004430-2-tps-899-548.png"

Activity recommendation

Alicloud based on Apache Flink Build enterprise class products - Real time computing Flink Version now opens the activity ：
99 Try it out Real time computing Flink edition （ Every year, every month 、10CU） That is, there is an opportunity to get Flink Exclusive custom clothes ; Another bag 3 Months and more and 85 A discount ！
Learn more about the event ：https://www.aliyun.com/product/bigdata/sc