当前位置：网站首页>Exploration on the framework of stream batch integration technology and its practice in kangaroo cloud number stack

Exploration on the framework of stream batch integration technology and its practice in kangaroo cloud number stack

2022-06-23 02:57:00 【Data stack dtinsight】

One 、 About stream batch integrated data warehouse

The integration of flow and batch is an architectural idea , This idea is about the same business , Use the same sql Logic , It can satisfy the calculation of both flow processing and batch processing tasks .

In terms of efficiency , Batch processing can only be done in t+1 Present business data in the form of , Stream processing can only be done in t+0 Present business data in the form of , When the two are independent, the enterprise needs to run two sets of code , Development 、 Operation and maintenance 、 High labor costs , Long presentation cycle . The flow batch integration uses one set of code to present two sets of business data , Development 、 Reduce operation and maintenance costs by half , The effectiveness has been significantly improved .

that , What is a stream batch data warehouse ？ To put it simply , It is Using the same set of computing engine and the unique data storage architecture of data warehouse theory to complete the real-time processing of data from heterogeneous sources 、 Offline analysis of business data sets .

This data set has the following characteristics ：

subject-oriented ： The data warehouse organizes data according to certain subject domains ;

Easy to integrate ： Eliminate inconsistencies in source data , Ensure the consistency of enterprise global information ;

Relatively stable ： Long term retention of data in the collection , Just load periodically 、 Refresh ;

Forecast the trend ： Historical information is stored in the data , It can make quantitative analysis and prediction on the development process and future trend of the enterprise .

Two 、 The evolution of data stack on the data warehouse of stream batch integration

As the customer volume increases , Customer demand is increasing , face PB Level batch data and stream data processing requirements , The stack technology team is facing more and more challenges , In this process, the architecture system of several stacks and warehouses has been gradually improved . from 2017 Years of batch processing based on traditional architecture 4 Years of iteration to the flow batch integrated data warehouse based on the hybrid architecture , Pictured ：

Schematic diagram of the evolution process of the hybrid data warehouse of the data stack, stream and batch integrated architecture

1. Batch processing based on traditional architecture

At the beginning of the Internet, although the amount of data increased dramatically , There are tens of millions of facts in a single day , But the customer demand scenario is more “t+1” form , Just for the day 、 Zhou 、 Analyze the data of the current month , These demands can be satisfied only by offline analysis .

To coincide with hadoop When ecology was just emerging , Based on the dilemma of data explosion and storage shortage, the stack technology team carries Hadoop Ecological chain , Import data periodically HDFS, utilize Hadoop platform Hive Doing data warehouse can realize HDFS Offline analysis of massive data sets on .

In fact, this stage has not changed much with the essential architecture of the Internet , Is still Load data periodically and analyze it , Only the technology used has changed from the classic data warehouse tool to the big data tool .

2. be based on Lambda The flow batch of the architecture is independent

With the Internet 、 Development of communication technology ,“ Every other day ” The data of can no longer meet the needs of customers , They expect more real-time data presentation , In this way, whether in Finance 、 Stock exchange or retail 、 Real time monitoring and early warning of the port , Decision makers can make favorable judgments at the first time , Improve efficiency and reduce losses .

In response to this change , The data stack technology team combined with the mainstream big data processing technology at that time , In the original HIVE Shucangshang , The most advanced stream batch integrated computing engine at that time was added Spark To accelerate offline computing performance . At the same time, on the original offline big data architecture , Added a clause based on Kafka Storage and Flink The stream processing link of the computing engine is used to complete the index calculation with high real-time requirements .

Although the use of Spark and Flink The computing engine meets the customer's requirements for the scene presentation of real-time data , But because of Spark Although the idea is to integrate flow and batch, the essence is to realize flow based on batch , There are still some hard injuries in the actual effect . In the same period Flink The computing engine is not perfect , The stack technology team then responded to Flink The function has been extended to some extent .

During this process, more data sources can be synchronized FlinkX And can be through Sql Real time calculation and writing of more data sources FlinkStreamSql.（ Open source , Feed it to open source . Several stack technical teams have shared them to Github On , Students who need it can read the original text to see .）

At this stage, several stacks of technical teams passed the self-developed FlinkX and FlinkStreamSql, A new stream computing link is added to the original offline link for real-time data analysis , From traditional big data architecture to Lambda Transformation of Architecture .

Lambda The core idea of architecture is Split the business , Businesses with high real-time requirements adopt real-time computing solutions , Services with low real-time requirements go offline , Finally, the data service layer analyzes and summarizes all the data for downstream use .

Lambda Flow chart of architecture flow batch independent processing

3. be based on Kappa Real time processing of architecture

Lambda The architecture basically meets customers' demands for real-time data , A large number of customers pass through several stacks DTinsight Realize the demand of data enabling production task , With tens of thousands of data every day , Counting stack DTinsight It can also maintain stable operation , It provides strong backing for customers in data-driven business .

although Lambda The architecture meets the real-time requirements of customers in business , However, with the development of the enterprise, the business volume is also gradually increasing , As a result, the development and operation and maintenance costs gradually increase . here Flink Stream processing technology is also gradually mature ,Flink Exactly-Once And state calculation can fully guarantee the accuracy of the final calculation result , So the stack technology team began to focus on Lambda How to make adjustments based on the architecture .

LinkedIn Jay, former chief engineer of · Cresps （Jay Kreps） Have been directed at Lambda The architecture offers an improved view ： improvement Lambda In the architecture Speed Layer, So that it can process real-time data , It also has the ability to reprocess previously processed historical data when the business logic is updated .

suffer Kreps Inspired by the , The stack team recommends that customers with more real-time services will Kafka Data log retention date for , When the code of the flow task changes or the upstream needs to be backtracked , Just keep the original Job N Immobility , Then start a job Job N+1, Specify... For historical data offset Calculate and write to a new N+1 In the table , When Job N+1 The progress of the calculation has caught up with Job After the progress of , You can change the original Job N Replace the task with Job N+1, Downstream business processes only need to be based on Job N+1 The generated tables are analyzed or displayed . In this way, the offline link layer can be removed , Reduce the amount of additional code development and maintenance for customers , At the same time, it unifies the business calculation caliber .

Lambda The drawback of the architecture is the need to maintain code that produces the same results in two complex distributed systems , and Reprocessing real-time data by increasing parallelism and replaying historical data can effectively replace the offline data processing system . This architecture is simple and avoids the problem of maintaining two sets of system code and maintaining consistent results .

Kappa Structure the flow chart of real-time data warehouse

4. be based on Kappa+Lambda Flow batch integrated data warehouse with hybrid architecture

adopt Lambda The architecture and Kappa framework , The data stack can solve the real-time scenarios and development, operation and maintenance requirements faced by most enterprises , However, some enterprises have high demand for real-time business, which will lead to inaccurate real-time calculation data due to extreme data disorder , At this time, the flow task faces the problem of data quality .

For this case, several stacks of technical teams are combined Kappa The architecture and Lambda Advantages of Architecture , adopt Labmda In the architecture, the offline link periodically revises the output data of the real-time link , At the same time combined with FlinkX The kernel supports streaming and batching , The computing layer is based on FlinkX The computing engine completes the computing tasks in the whole link , To ensure the final consistency of data .

3、 ... and 、 A core engine integrating several stacks, streams and batches FlinkX Technical interpretation

FlinkX It's based on Flink Stream batch unified data synchronization and SQL Calculation tools . It can collect static data , such as MySQL,HDFS Business data in , It can also collect real-time changing data , such as MySQL、 Binlog、Kafka etc. . stay FlinkX1.12 in , Will also FlinkStreamSql Blend in , bring FlinkX1.12 It can collect static data through synchronous tasks 、 Dynamic data , Can pass again SQL The task performs flow batch processing on the collected data according to the business timeliness .

In the stack ,FlinkX The realization of the integration of stream and batch is embodied in the data acquisition layer and the data calculation layer .

1. Data acquisition layer

In terms of data tense , Data can be divided into real-time data and offline data . Such as Kafka、EMQ This kind of high-throughput message oriented middleware usually holds a steady stream of data , So you can go through FlinkX The real-time collection task of is to drop the data into the database in real time , So that the follow-up tasks can be carried out in near real time 、 Quasi real time business computing . image Mysql、Oracle This kind of OLTP Databases are usually historical transaction data that is held , This kind of data is based on days 、 Month is the time unit for storage and calculation , So you can use FlinkX Our offline synchronization task synchronizes this kind of data in an incremental or full way to our OLAP Data warehouse or data lake , Then, according to various business indicators, the data is stratified and analyzed in batches .

in addition , In addition to collecting data to the storage tier , It will also be based on the data specification defined in data governance and combined with the data warehouse specification , adopt FlinkX The synchronization task of completes the cleaning of data 、 Transformation and dimension completion , So as to improve the validity of data and the accuracy of business calculation .

2. Data computing layer

When the data is collected to the specified storage layer , It will perform routine business calculation on data in combination with storage type and business timeliness .FlinkX Sql The ability to support flow batch calculation comes from Flink Kernel in 1.12 The unified management of metadata in the version and DataStream API Batch execution mode is supported on , This enhances the reusability and maintainability of jobs , bring FlinkX Jobs can switch freely between flow and batch execution modes and only need to maintain a set of code , No need to rewrite any code . and , Compared to open source Flink,FlinkX 1.12 Not only provide more Source as well as Sink To support real-time and off-line calculation of various data sources, and also realize the dirty data management plug-in , Let customers in ETL Stage for error non-conforming data can be handled by perception and fault tolerance .

FlinkX Realize the flow chart of flow batch integration in the data stack

3. The practice of integrating data stack, stream and batch in data warehouse

In combination with the architecture diagram scenario, the following describes how to integrate the next stack, stream and batch .

scene ： In the stock exchange K Lines have time-sharing graphs 、 Daily charts 、 Weekly chart, etc , After the completion of the user's stock transaction, the user needs to K Online display of trading point and transaction amount .

The data stack does not realize the integrated processing mode of stream and batch ：

For the above scenario, before the integration of flow and batch is realized in the stack, the buying and selling points of the time-sharing graph will adopt Flink Calculation , Japan K、 Zhou K And so on through the configuration cycle Spark Task to calculate , Classic Lambda framework , The pain point of this architecture is obvious , Maintaining two sets of code is inefficient 、 Two sets of computing engines are expensive 、 The data caliber is inconsistent .

The data stack implements the post-processing mode of stream batch integration ：

On the data stack platform, first select to create real-time collection and data synchronization tasks to collect the business library data to Kafka and Iceberg, That is, the number of storehouses ODS layer . Real time data warehouse and offline data warehouse are from ODS To DWD The processing logic of layer data cleaning and data widening is the same , The table definition structure is also consistent , So this step only needs to implement a set of Flink SQL The stack platform will automatically translate into Flink Stream and Flink Batch Tasks can be used in both real-time and offline data warehouses . Real time warehouse and offline warehouse DWS Layer stores time-sharing chart trading point information and daily information respectively K、 Zhou K Data such as , The processing logic of the two sides is different, so two sets need to be developed according to the business in this layer SQL, Stream Flink SQL Connect with real-time data warehouse DWD Layer data real-time calculation of time-sharing chart buying and selling points ,Batch Flink SQL Connect offline data warehouse DWD Layer data cycle scheduling calculation day K、 Zhou K Wait for point of sale data . Application layer services directly from DWS The layer obtains the point of sale data for display .

Through the example, we can see that several stacks have been selected Iceberg As a storage layer integrating stream and batch , Here's why ：

1. Iceberg What is stored is raw data , Data structures can be diversified ;

2. Iceberg Support multiple computing models , It is a universal design Table Format, Perfectly decoupled the computing engine from the underlying storage system ;

3. Iceberg The underlying storage supports flexibility , It's usually used S3、OSS、HDFS This cheap distributed file system , Using a specific file format and cache can meet the data analysis requirements of the scenario ;

4. Iceberg The community resources behind the project are very rich , At home and abroad, there are many large companies running massive amounts of data in Iceberg On ;

5. Iceberg Save full data , When a stream computing task needs to rerun historical data, it can be accessed from Iceberg Read the data and seamlessly switch to Kafka that will do .

Four 、 The integration of flow and batch empowers the enterprise

With the continuous development of big data , Enterprises' demands for business scenarios range from offline satisfaction to high real-time requirements , Several stacks of products are also undergoing continuous iterative upgrading in this process , Improve the quality of data calculation results for enterprises , Improve the R & D efficiency of enterprise business , It provides a powerful help in reducing the maintenance cost of the enterprise .

1. Improve the quality of data calculation results

High-quality 、 High accuracy data is helpful for enterprises to make excellent decisions , The data stack unifies the computing engine based on the flow batch integrated data warehouse of the hybrid architecture , It solves the problem between two sets of codes of different engines SQL Logic cannot be reused , Data consistency and quality are guaranteed .

2. Improve the R & D efficiency of enterprise business

From business development to online , Business developers only need to develop a set for business SQL Mission , Then, according to the service delay standard, you can flexibly switch between flow batch calculations . Application side developers only need to splice one set for business SQL Encapsulate logic .

3. Improve enterprise resource utilization , Reduce maintenance costs

Real time for enterprise users 、 Offline business only needs to run on the same set of computing engines . No need to run in real time 、 Different computing engines for offline services purchase highly configured hardware resources . For business changes , Developers only need to modify the corresponding SQL Mission , There is no need to consider real-time 、 Offline tasks are modified separately .

5、 ... and 、 The future planning

although FlinkX SQL To a certain extent, it improves the ability of flow batch calculation , However, the effectiveness of batch processing needs to be improved , Next, the stack technology team will start from Flink Source level to the operator and Task Make some optimizations , Improve batch level computing efficiency and reduce enterprise time cost . At the same time, further unify the metadata standards in the data sources , Let the data dictionary involved in the data governance process of the enterprise 、 Data consanguinity 、 Data quality 、 Permission management and other modules can be quickly responded in the subsequent use level , Reduce enterprise management costs .

Data stack stream batch integrated architecture , Real time data warehouse has been realized through iteration +OLAP Scene combination , Just a set of code can be used for multiple calculation processing modes , It not only meets the low latency of enterprises 、 Time effective business driven requirements , It also reduces the cost of enterprise development 、 Operation and maintenance 、 Labor cost . Of course, this is only the first step in the exploration of the integration of flow and batch , The stack technology team will continue to dig deeply at the data storage level , Convenient management of data warehouse 、 The characteristics of high-quality data and the explorability of data lake 、 High flexibility , Complete the transformation of data stack from data warehouse to Lake warehouse , Realize the ability of unified storage of unknown data before flexible exploration , Go further at the data architecture level .

原网站

版权声明
本文为[Data stack dtinsight]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/01/202201261219134714.html