当前位置：网站首页>Hucang integrated e-commerce project (I): introduction to the project background and structure

Hucang integrated e-commerce project (I): introduction to the project background and structure

2022-07-24 09:32:00 【Lansonli】

List of articles

Introduction to project background and architecture

One 、 Project background

Two 、 Project framework

1、 Current situation of real-time data warehouse

2、 Project architecture and data layering

3、 Project visualization

Introduction to project background and architecture

One 、 Project background

Hucang integrated real-time e-commerce project is an e-commerce data analysis platform based on an e-commerce project of a treasure mall , In terms of technology, this project involves the construction of big data technology components , Lake warehouse integrated layered warehouse design 、 Real time to offline data index analysis and data large screen visualization , The technical components used in the project start from the foundation , The purpose is to integrate data warehouse and data Lake in the integrated architecture of Lake warehouse , Realize offline and real-time data index analysis of enterprise level projects . In terms of business, it is temporarily related to the theme of members and commodities , Analysis indicators include user real-time login information analysis 、 Live browsing pv/uv analysis 、 Real time product browsing information analysis 、 User score index analysis , In the future, we will continue to increase business indicators and improve the architecture design .

Two 、 Project framework

1、 Current situation of real-time data warehouse

Currently based on Hive Our offline data warehouse is very mature , With the continuous development of real-time computing engine and business demand for real-time report output continues to expand , In recent years, the industry has been focusing on and exploring the construction of real-time warehouse . According to the evolution process of data warehouse architecture , stay Lambda The architecture includes two links: offline processing and real-time processing , The architecture is shown below ：

It is precisely because the two links process data, resulting in data inconsistency and other problems, so there is Kappa framework ,Kappa The structure is as follows ：

Kappa Architecture can be called real-time data warehouse , At present, the most commonly used implementation in the industry is Flink + Kafka, However, based on Kafka+Flink The real-time data warehouse scheme also has several obvious defects , Therefore, in many enterprises, hybrid architecture is often used in the construction of real-time data warehouse , Not all businesses adopt Kappa Implementation of real-time processing in the architecture .Kappa The architecture defects are as follows ：

Kafka Can't support massive data storage . For lines of business with massive amounts of data ,Kafka Generally, it can only store data for a very short time , Like the last week , Even the last day .
Kafka Can't support efficient OLAP Inquire about , Most businesses want to be in DWD\DWS Layer supports ad hoc queries , however Kafka It's not very friendly to support such a demand .
It is impossible to reuse the mature data consanguinity based on offline data warehouse 、 Data quality management system . We need to re implement a set of data consanguinity 、 Data quality management system .
Kafka I won't support it update/upsert, at present Kafka Support only append. In the actual scene DWS The light convergence layer needs to be updated very often ,DWD Detail layer to DWS The light aggregation layer generally aggregates according to time granularity and dimension , To reduce the amount of data , Improve query performance . If the original data is second level data , The aggregation window is 1 minute , It is possible that some delayed data needs to be updated after time window aggregation . This part of the update requirements cannot be used Kafka Realization .

So the development of real-time data warehouse to the present architecture , To a certain extent, it solves the problem of timeliness of data reports , But there are still many problems with such an architecture ,Kappa In addition to the problems mentioned above , Companies with more real-time business needs are choosing Kappa After the architecture , Some scenarios of unified calculation of offline data cannot be avoided , in the light of Kappa Architecture often needs to be targeted at a certain layer Kafka Data rewriting real-time program for unified calculation , Very inconvenient .

With the emergence of data Lake Technology , send Kappa It is possible to realize the unified calculation of batch data and real-time data by architecture . This is what we heard today “ Batch flow integration ”, Many people in the industry believe that batch and flow are unified at the development level SQL The upper processing is a batch flow integration , Some people also think that at the level of computing engine, batch and flow can be integrated into the same computing engine, which is the integration of batch and flow , such as ：Spark/SparkStreaming/Structured Streaming/Flink The framework realizes the integration of batch processing and stream processing at the level of computing engine .

Whether in business SQL Unified in use or unified in computing engine , It's one aspect of batch flow integration , besides , Another core aspect of batch flow integration is the unity of storage . Data Lake technology can realize the unified storage of batch data and real-time data , Unified processing calculation . We can integrate the data storage of offline data warehouse and real-time data warehouse into the data Lake , Can be Kappa Data warehouse layering in the architecture Kafka Replace storage with data Lake Technology storage , This way “ The lake and the warehouse are integrated ” The construction of .

“ The lake and the warehouse are integrated ” Architecture construction is also the current way for major companies to uniformly process and calculate offline and real-time scenarios . for example ： Some large companies use Iceberg As the storage , that Kappa Many problems in the architecture can be solved ,Kappa The architecture will look like this ：

In this architecture, whether it is stream processing or batch processing , Data storage is unified into data lake Iceberg On , This set of structure will unify the storage , It's solved Kappa There are many pain points , The solutions are as follows ：

Can solve Kafka The problem of small amount of stored data . At present, the basic idea of all data lakes is based on HDFS A file management system based on , So the data volume can be very large .
DW Layer data can still support OLAP Inquire about . Again, the data lake is based on HDFS Implementation on top , Just need the current OLAP The query engine can do some adaptation OLAP Inquire about .
Batch stream storage is based on Iceberg/HDFS After storage , You can reuse the same set of data 、 Data quality management system .
Update of real-time data .

The above architecture can also be considered as Kappa Variations of Architecture , There are also two data links , One is based on Spark Offline data link for , One is based on Flink Real time data link for , Usually, the data is processed directly through the real-time link , Offline links are more used in unconventional scenarios such as data correction . Such an architecture should be a real-time data warehouse solution that can be implemented 、 Can achieve real-time report generation .

2、 Project architecture and data layering

The data Lake technology we use in this project is Iceberg structure “ The lake and the warehouse are integrated ” Architecture to analyze e-commerce business indicators in real time and offline . The overall structure of the project is shown in the figure below ：

There are two kinds of data sources in the project , One is MySQL Business Library Data , The other is user log data , We first collect the two types of data in a corresponding way Kafka Their respective topic in , adopt Flink Processing stores business and log data in Iceberg-ODS Layer , Due to the present Flink be based on Iceberg Processing real-time data can't save data consumption location information well , So here the data is stored in Kafka in , utilize Flink consumption Kafka Automatic data maintenance offset To ensure the correctness of consumption data after the program stops and restarts .

The whole architecture is based on Iceberg Build a data warehouse hierarchy , after Kafka The processing data are stored in the corresponding Iceberg In layers , The real-time data results are finally analyzed and stored in Clickhouse in , Offline data analysis results are directly from Iceberg-DWS Get data analysis in layer , The analysis results are stored in MySQL in ,Iceberg Other layers are for temporary business analysis , Final Clickhouse and MySQL The results in are displayed by visual tools .

3、 Project visualization

Blog home page ：https://lansonli.blog.csdn.net
Welcome to thumb up Collection Leaving a message. Please correct any mistakes ！
This paper is written by Lansonli original , First appeared in CSDN Blog
When you stop to rest, don't forget that others are still running , I hope you will seize the time to learn , Go all out for a better life

原网站

版权声明
本文为[Lansonli]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207221533468494.html