当前位置:网站首页>Hucang integrated e-commerce project (I): introduction to the project background and structure
Hucang integrated e-commerce project (I): introduction to the project background and structure
2022-07-24 09:32:00 【Lansonli】

List of articles
Introduction to project background and architecture
One 、 Project background
Two 、 Project framework
1、 Current situation of real-time data warehouse
2、 Project architecture and data layering
3、 Project visualization
Introduction to project background and architecture
One 、 Project background
Hucang integrated real-time e-commerce project is an e-commerce data analysis platform based on an e-commerce project of a treasure mall , In terms of technology, this project involves the construction of big data technology components , Lake warehouse integrated layered warehouse design 、 Real time to offline data index analysis and data large screen visualization , The technical components used in the project start from the foundation , The purpose is to integrate data warehouse and data Lake in the integrated architecture of Lake warehouse , Realize offline and real-time data index analysis of enterprise level projects . In terms of business, it is temporarily related to the theme of members and commodities , Analysis indicators include user real-time login information analysis 、 Live browsing pv/uv analysis 、 Real time product browsing information analysis 、 User score index analysis , In the future, we will continue to increase business indicators and improve the architecture design .
Two 、 Project framework
1、 Current situation of real-time data warehouse
Currently based on Hive Our offline data warehouse is very mature , With the continuous development of real-time computing engine and business demand for real-time report output continues to expand , In recent years, the industry has been focusing on and exploring the construction of real-time warehouse . According to the evolution process of data warehouse architecture , stay Lambda The architecture includes two links: offline processing and real-time processing , The architecture is shown below :

It is precisely because the two links process data, resulting in data inconsistency and other problems, so there is Kappa framework ,Kappa The structure is as follows :

Kappa Architecture can be called real-time data warehouse , At present, the most commonly used implementation in the industry is Flink + Kafka, However, based on Kafka+Flink The real-time data warehouse scheme also has several obvious defects , Therefore, in many enterprises, hybrid architecture is often used in the construction of real-time data warehouse , Not all businesses adopt Kappa Implementation of real-time processing in the architecture .Kappa The architecture defects are as follows :
- Kafka Can't support massive data storage . For lines of business with massive amounts of data ,Kafka Generally, it can only store data for a very short time , Like the last week , Even the last day .
- Kafka Can't support efficient OLAP Inquire about , Most businesses want to be in DWD\DWS Layer supports ad hoc queries , however Kafka It's not very friendly to support such a demand .
- It is impossible to reuse the mature data consanguinity based on offline data warehouse 、 Data quality management system . We need to re implement a set of data consanguinity 、 Data quality management system .
- Kafka I won't support it update/upsert, at present Kafka Support only append. In the actual scene DWS The light convergence layer needs to be updated very often ,DWD Detail layer to DWS The light aggregation layer generally aggregates according to time granularity and dimension , To reduce the amount of data , Improve query performance . If the original data is second level data , The aggregation window is 1 minute , It is possible that some delayed data needs to be updated after time window aggregation . This part of the update requirements cannot be used Kafka Realization .
So the development of real-time data warehouse to the present architecture , To a certain extent, it solves the problem of timeliness of data reports , But there are still many problems with such an architecture ,Kappa In addition to the problems mentioned above , Companies with more real-time business needs are choosing Kappa After the architecture , Some scenarios of unified calculation of offline data cannot be avoided , in the light of Kappa Architecture often needs to be targeted at a certain layer Kafka Data rewriting real-time program for unified calculation , Very inconvenient .
With the emergence of data Lake Technology , send Kappa It is possible to realize the unified calculation of batch data and real-time data by architecture . This is what we heard today “ Batch flow integration ”, Many people in the industry believe that batch and flow are unified at the development level SQL The upper processing is a batch flow integration , Some people also think that at the level of computing engine, batch and flow can be integrated into the same computing engine, which is the integration of batch and flow , such as :Spark/SparkStreaming/Structured Streaming/Flink The framework realizes the integration of batch processing and stream processing at the level of computing engine .
Whether in business SQL Unified in use or unified in computing engine , It's one aspect of batch flow integration , besides , Another core aspect of batch flow integration is the unity of storage . Data Lake technology can realize the unified storage of batch data and real-time data , Unified processing calculation . We can integrate the data storage of offline data warehouse and real-time data warehouse into the data Lake , Can be Kappa Data warehouse layering in the architecture Kafka Replace storage with data Lake Technology storage , This way “ The lake and the warehouse are integrated ” The construction of .
“ The lake and the warehouse are integrated ” Architecture construction is also the current way for major companies to uniformly process and calculate offline and real-time scenarios . for example : Some large companies use Iceberg As the storage , that Kappa Many problems in the architecture can be solved ,Kappa The architecture will look like this :

In this architecture, whether it is stream processing or batch processing , Data storage is unified into data lake Iceberg On , This set of structure will unify the storage , It's solved Kappa There are many pain points , The solutions are as follows :
- Can solve Kafka The problem of small amount of stored data . At present, the basic idea of all data lakes is based on HDFS A file management system based on , So the data volume can be very large .
- DW Layer data can still support OLAP Inquire about . Again, the data lake is based on HDFS Implementation on top , Just need the current OLAP The query engine can do some adaptation OLAP Inquire about .
- Batch stream storage is based on Iceberg/HDFS After storage , You can reuse the same set of data 、 Data quality management system .
- Update of real-time data .
The above architecture can also be considered as Kappa Variations of Architecture , There are also two data links , One is based on Spark Offline data link for , One is based on Flink Real time data link for , Usually, the data is processed directly through the real-time link , Offline links are more used in unconventional scenarios such as data correction . Such an architecture should be a real-time data warehouse solution that can be implemented 、 Can achieve real-time report generation .
2、 Project architecture and data layering
The data Lake technology we use in this project is Iceberg structure “ The lake and the warehouse are integrated ” Architecture to analyze e-commerce business indicators in real time and offline . The overall structure of the project is shown in the figure below :

There are two kinds of data sources in the project , One is MySQL Business Library Data , The other is user log data , We first collect the two types of data in a corresponding way Kafka Their respective topic in , adopt Flink Processing stores business and log data in Iceberg-ODS Layer , Due to the present Flink be based on Iceberg Processing real-time data can't save data consumption location information well , So here the data is stored in Kafka in , utilize Flink consumption Kafka Automatic data maintenance offset To ensure the correctness of consumption data after the program stops and restarts .
The whole architecture is based on Iceberg Build a data warehouse hierarchy , after Kafka The processing data are stored in the corresponding Iceberg In layers , The real-time data results are finally analyzed and stored in Clickhouse in , Offline data analysis results are directly from Iceberg-DWS Get data analysis in layer , The analysis results are stored in MySQL in ,Iceberg Other layers are for temporary business analysis , Final Clickhouse and MySQL The results in are displayed by visual tools .
3、 Project visualization

- Blog home page :https://lansonli.blog.csdn.net
- Welcome to thumb up Collection Leaving a message. Please correct any mistakes !
- This paper is written by Lansonli original , First appeared in CSDN Blog
- When you stop to rest, don't forget that others are still running , I hope you will seize the time to learn , Go all out for a better life
边栏推荐
- Tiflash source code reading (V) deltatree storage engine design and implementation analysis - Part 2
- Learning transformer: overall architecture and Implementation
- 【笔记】什么是内核/用户空间 从CPU如何运行程序讲起
- Leetcode question brushing series -- 174. Dungeon games
- Jenkins post build script does not execute
- [don't bother to strengthen learning] video notes (IV) 2. Dqn realizes maze walking
- Practice 4-6 number guessing game (15 points)
- How to open the port number of the server, and the corresponding port of common network services
- PHP Basics - session control - Session
- Vscode failed to use SSH Remote Connection (and a collection of other problems)
猜你喜欢

科目1-2

Getting started with web security - open source firewall pfsense installation configuration
![[don't bother to strengthen learning] video notes (IV) 1. What is dqn?](/img/74/51219a19595f93e7a85449f54d354d.png)
[don't bother to strengthen learning] video notes (IV) 1. What is dqn?

Lung CT segmentation challenge 2017 dataset download and description

排序入门—插入排序和希尔排序

Aruba学习笔记06-无线控制AC基础配置(CLI)

C#/VB. Net: convert word or EXCEL documents to text

(5) Cloud integrated gateway gateway +swagger documentation tool

Protocol buffers 的问题和滥用

Aruba learning notes 06 wireless control AC basic configuration (CLI)
随机推荐
dp最长公共子序列详细版本(LCS)
Vector control of permanent magnet synchronous motor (I) -- mathematical model
Wenxin big model raises a new "sail", and the tide of industrial application has arrived
Scheme and software analysis of dual computer hot standby system "suggestions collection"
One click openstack single point mode environment deployment - preliminary construction
NVIDIA set persistent mode
[don't bother to strengthen learning] video notes (III) 3. SARS (lambda)
ASI-20220222-Implicit PendingIntent
PHP Basics - PHP types
The difference between classification and regression
Getting started with identityserver4
Six pictures show you why TCP shakes three times?
We were tossed all night by a Kong performance bug
Android system security - 5.3-apk V2 signature introduction
[note] what is kernel / user space? Let's start with how the CPU runs the program
What does CRM mean? Three "key points" for CRM management software selection
来阿里一年后我迎来了第一次工作变动....
Ue5 film and television animation rendering MRQ layered learning notes
DP longest common subsequence detailed version (LCS)
Vim: extend the semantic analysis function of YCM for the third-party library of C language