当前位置：网站首页>Introduction to data warehouse

Introduction to data warehouse

2022-06-28 00:11:00 【Willow shallot】

Recently, many students asked me , What is the content of my internship ？ Are there any reference learning routes ？ Every time I say it's a digital warehouse development , But many students don't know much about data warehouse , So I wrote a blog , Let's introduce the data warehouse , If you are interested in the content of previous issues, you can see the following articles ：

link : Willow shallot hadoop The way
link : Willow shallot spark The way
link : Willow shallot flink The way
link : Liu XiaoCong's big data interview road

This article will be from the perspective of beginners , Let's introduce what a data warehouse is , And the theoretical basis of data warehouse , As for the contents related to the company's business , I will not make an introduction . You may feel a little confused after reading the following contents , This is normal , Because the memory of theory needs the help of practice , You will remember deeply if you practice it yourself .

List of articles

1. The generation of data warehouse
- 1.1 Database and data warehouse
2. The idea of data warehouse modeling
3. The idea of data warehouse layering
- 3.1 Theoretical basis of data warehouse layering
- 3.2 Why layering
4. Data warehouse development technology
5. Reference article

1. The generation of data warehouse

For businesses , Most of the data is stored in the online business system , such as ： Production management system 、 Sales system 、 Inventory management system, etc , As the business goes on , Data will be generated continuously , It is usually stored in the business database , Such as mysql、orcale in , Used to support the operation of business systems .
But as time goes by , There will be more and more data in the business system , When the data stored in the database needs to be analyzed , Many tables will be associated 、 Some tables are accessed many times 、 Some tables are accessed less , Can not meet the needs of data analysis . Now let me show you some books I often read ：
Insert picture description here

1.1 Database and data warehouse

At this point, I want to talk about the difference between database and data warehouse , The design of database generally adopts ER entity model , It is strictly required to meet 3NF To minimize data redundancy , The data warehouse does not need to strictly meet 3NF Of . Some students may have forgotten 3NF What is it? ？

1NF： Each column of a database table is an indivisible atomic data item .
2NF： Make sure that every column in the database table is related to the primary key , You can't just relate to a part of the primary key （ Eliminate partial dependence ）
3NF： Make sure that every column in the data table is directly related to the primary key , Not indirectly （ Eliminate delivery dependency ）

Next , We need an introduction 2 A concept ：OLTP and OLAP, These two concepts represent two ways of processing data , These two treatments , This corresponds to the meaning of storing data in databases and data warehouses .

On line transaction processing OLTP(On-line transaction processing)： Daily operation of online database , It is usually a query for a record or a group of records and an update , It mainly serves the specific application of the enterprise , In a transactional environment , What people care about is response time 、 Security and integrity of data .
On line analytical processing OLAP(On-Line Analytical Processing): It is used for decision analysis of managers , Decision making data are mostly historical 、 Aggregate or computational data 、 It is often necessary to access a large amount of historical data , Support complex query and multidimensional analysis . External data is often used in analytical processing , This part of data is not generated by the transaction processing system , It comes from other external data sources .

OLTP and OLAP The difference between ：

characteristic	OLTP	OLAP
features	The operation process	Information processing
user	The clerk 、DBA、 Database professionals	Data Analyst 、 project manager 、 The product manager
function	Daily operation	Long-term information requirements 、 Decision support
DB Design	be based on ER Model 、 For application	Dimensional modeling 、 subject-oriented
data	Current ; Make sure it's up to date	The history of the ; Cross time maintenance
Summary	The original , Highly detailed	A summary of the 、 A unified
View	detailed , General relationship	Summary 、 multidimensional
Work unit	A short 、 Simple transaction	Complex queries
access	read / Write	Mainly reading
Focus on	Data into the	Information output
operation	Index on keyword / hash	A lot of scanning
Number of access records	Dozens of	Millions of
The number of users	Thousands of	Hundreds of
DB scale	100M To GB	100GB To TB
first	High performance 、 High availability	High flexibility , Endpoint user autonomy
The number of users	Thousands of	Hundreds of
Measure	Transaction throughput	Query throughput , response time

T for transaction, Small amount of data in a single query 、 High concurrency 、 It's very random 、 Millisecond delay 、ACID
- MySQL、PostgreSQL、SQL Server、Oracle
A for Analysis, Large amount of data 、 Low concurrent 、 Batch 、 Second to minute delay 、 Does not require ACID, Pursuit “ Final consistency ”
- Hive、HBase、Spark、Clickhouse

2. The idea of data warehouse modeling

2.1 ER Model

ER Model , Also called entity relationship model （ Entity Relationship, ER） With entities - Relationship - Entity pattern to describe enterprise business , In the theory of paradigm, it is consistent with 3NF. In the data warehouse 3NF And OLTP In the system 3NF The difference is that , It is a theme oriented abstraction from the perspective of enterprise , Instead of abstracting entity object relationships for a specific business process . It mainly has the following characteristics ：

Need a comprehensive understanding of enterprise business and data （ It's hard to understand everything ）
The implementation cycle is very long .（ Not suitable for modification ）
The ability requirements of the modeler are very high .（ High level ）
use ER The starting point of building data warehouse model is to integrate data , Combine and merge the data in each system according to the theme from the perspective of the whole enterprise , And consistency processing , For data analysis and decision making , But it can't be used directly to analyze decisions .
The modeling process is mainly divided into 3 Stages ：

High level models ： A highly abstract model , Describe the main themes and the relationship between them , Used to describe the overall business profile of an enterprise .
Middle model ： Based on the high-level model , Refine the data items of the topic .
The physical model （ Also known as the underlying model ）： On the basis of the middle-level model , Consider physical storage , At the same time, design physical properties based on performance and platform characteristics , It is also possible to merge some tables 、 Partition design, etc .
summary ：ER In fact, the model is more like the system design process in the information management system , Need a comprehensive understanding of the business 、 The long real-time cycle is not suitable for enterprises with current development speed and frequent business changes .

2.2 Dimension model

Dimension modeler is the most popular data warehouse modeling method in the field of data warehouse engineering . Dimension modeling starts from analyzing the needs of decision-making to build the model , Serve the analysis of needs , Therefore, it focuses on how users can complete more quickly Demand analysis , At the same time, it has better response performance for large-scale complex queries . The typical representative is the star model , And the snowflake model used in some special scenes . Here we introduce the star model in dimension modeling 、 Snowflake model 、 Constellation model .

Star model ： It is composed of a fact table and multiple dimension tables with non standardized descriptions , Star model can adopt relational database structure , The core of the model is the fact table , Around the fact table is the dimension table .

Insert picture description here

Snowflake model ： It is an extension of the star model , Each dimension can be externally connected to multiple detailed category tables .

Insert picture description here

Constellation model ： It's also an extension of the star model . The difference is that there are multiple fact tables in the constellation model , Sharing dimension table information between different fact tables , It is often used in scenarios with more complex data relationships . It is often called a galactic model .

Insert picture description here

The design is divided into the following steps ：

Select the business processes that need to be analyzed and made decisions . A business process can be a single business event , For example, the payment of a transaction 、 Refund, etc ; It can also be the state of an event , Such as the current account balance ; It can also be a business flow composed of a series of related business events cheng , What we need to see is the occurrence of certain events , Or the current state , Or the efficiency of event circulation .
Choose the granularity . In event analysis , We need to predict the degree to which all analysis needs to be subdivided, so as to determine the granularity of selection . Granularity is a combination of dimensions
Identify dimension table . After choosing the granularity , We need to design dimension table based on this granularity , Including dimension attributes , It is used for grouping and screening in analysis .
Choose the facts . Identify the indicators that need to be measured in the analysis .

2.3 Data Vault Model

Data Vault The model is ER Derivation of model , The starting point of its design is also to achieve data integration , But it can't be directly used for data analysis and decision making . It emphasizes the establishment of an auditable underlying data layer , That is to emphasize the historical nature of data 、 Traceability and atomicity , And it doesn't require excessive consistency processing and integration of data ; At the same time, it organizes enterprise data structurally based on the concept of theme , Further paradigm processing is introduced to optimize the model , In response to swimming 、 Expansibility of system changes . Data Vault The model consists of the following parts ：

Hub： It is the core business entity of the enterprise , from Entity key 、 Data warehouse sequence proxy key 、 Loading time 、 Data sources consist of .
Link ： representative Hub Between Relationship . Here with ER The biggest difference between models is that the relationship is abstracted as an independent unit , It can improve the extensibility of the model . It can directly describe 1 : 1 、 1:n and n:n The relationship between , And there is no need to make any changes . It consists of Hub The surrogate key for 、 Loading time 、 Data sources consist of .
Satellite ： yes Hub A detailed description of , One Hub There can be Multiple Satellite. It consists of Hub The surrogate key for 、 Loading time 、 Source type 、 The details of the Hub Describe the composition of information .

Data Vault Model ratio ER Models are easier to design and produce , its ETL Processing can be configured ,Data Vault The core idea of ： Hub Imagine the skeleton of an adult , that Link It's the ligament that connects the skeleton , and SateIIite It's the flesh and blood on the skeleton .

3. The idea of data warehouse layering

3.1 Theoretical basis of data warehouse layering

The data model is divided into several levels , Different companies are divided into different levels according to their business needs , But most of them are in the following levels .

Insert picture description here

ODS layer （Operational Data Store）： The main function of this level is to store the data directly obtained from the source system （ Data from data structure 、 The logical relationship between the data is basically consistent with the source system ）. Realize the data warehouse technology processing of some business system fields 、 A small amount of basic data cleaning （ For example, dirty data filtering 、 Character set conversion 、 Dimension value processing ）、 Generate incremental data table .
DIM layer （Dimension）： This layer mainly stores simple 、 static state 、 Dimension table of code class , Ranging from OLTP Layer extraction transform dimension table 、 Dimension tables built according to business or analysis requirements and warehouse technical dimension tables such as date dimension tables .
DWD layer （Data Warehouse Detail）： The main function of this layer is based on the division of topic domains , Business oriented themes 、 Data driven design model , Complete data integration , Provide a unified basic data source . Number completed at this level According to the cleaning 、 redefinition 、 Integrate classification functions .
DWM layer （Data Warehouse Model）： Subject oriented analysis 、 Unified data access , All basic data 、 The basic indicator base and multidimensional model of business rules and business entities are all calculated here 、 Unified Modeling , A large number of basic indicator libraries and multidimensional models are implemented in this layer . This level is driven by analysis requirements for model design , Realize the association calculation or light summary calculation of cross business subject domain data , Therefore, there will be multi table associated summary calculation with large amount of data .
DM layer （Data Warehouse Model）： The main function of this level is to process wide tables with multidimensional redundancy （ Solve complex queries ）、 Summary table of multi angle analysis .
APP layer （Application）： The main function of this level is to provide differentiated data services 、 Meet the needs of the business side ; Reports are implemented at this level （tableau、 Email report ）、 Self service data retrieval, etc .

3.2 Why layering

Many students know that data warehouse needs layering , But it may not be clear why it is layered , Here are the benefits of layering ：

Clear data structure ： Every data tier has its scope , This makes it easier for us to locate and understand when we use tables .
It's convenient for data and blood relationship tracking ： Simply speaking , What we finally present to the business is a business table that can be used directly , But it comes from many sources , If there's a problem with one of the source tables , We want to be able to quickly and accurately locate the problem , And understand the scope of its harm .
Reduce redevelopment ： Standardize data tiering , Develop some common middle tier data , It can reduce the huge repeated calculation .
Simplify complex problems ： Decompose a complex task into multiple steps to complete , Each layer deals with a single step , It's simpler and easier to understand . And it's easy to maintain the accuracy of the data , When the data goes wrong , You don't have to fix all the data , Just start with the problem steps and fix it .
Shield the exception of the original data : Block the impact of business , You don't have to change the service once, you need to re access the data .

4. Data warehouse development technology

In this part, I will briefly introduce , Formally speaking, the development of data warehouse , It is mainly divided into offline and real-time .

Offline development of technical components ：hadoop、spark、hive、yarn、HDFS、 Data lake, etc
Real time development technology components ：flink、sparkstreaming、kafka etc.

in general , Data warehouse development requires some knowledge of each component , about spark and flink The detailed mechanism of the need for more in-depth understanding , Commonly used is sql, You can also use pysprak Development , Occasionally you need to write some custom functions , Need to be able to solve some data optimization 、 Governance issues , Provide better data assurance for business students .