当前位置:网站首页>Introduction to data warehouse
Introduction to data warehouse
2022-06-28 00:11:00 【Willow shallot】
Recently, many students asked me , What is the content of my internship ? Are there any reference learning routes ? Every time I say it's a digital warehouse development , But many students don't know much about data warehouse , So I wrote a blog , Let's introduce the data warehouse , If you are interested in the content of previous issues, you can see the following articles :
- link : Willow shallot hadoop The way
- link : Willow shallot spark The way
- link : Willow shallot flink The way
- link : Liu XiaoCong's big data interview road
This article will be from the perspective of beginners , Let's introduce what a data warehouse is , And the theoretical basis of data warehouse , As for the contents related to the company's business , I will not make an introduction . You may feel a little confused after reading the following contents , This is normal , Because the memory of theory needs the help of practice , You will remember deeply if you practice it yourself .
List of articles
1. The generation of data warehouse
For businesses , Most of the data is stored in the online business system , such as : Production management system 、 Sales system 、 Inventory management system, etc , As the business goes on , Data will be generated continuously , It is usually stored in the business database , Such as mysql、orcale in , Used to support the operation of business systems .
But as time goes by , There will be more and more data in the business system , When the data stored in the database needs to be analyzed , Many tables will be associated 、 Some tables are accessed many times 、 Some tables are accessed less , Can not meet the needs of data analysis . Now let me show you some books I often read :
1.1 Database and data warehouse
At this point, I want to talk about the difference between database and data warehouse , The design of database generally adopts ER entity model , It is strictly required to meet 3NF To minimize data redundancy , The data warehouse does not need to strictly meet 3NF Of . Some students may have forgotten 3NF What is it? ?
- 1NF: Each column of a database table is an indivisible atomic data item .
- 2NF: Make sure that every column in the database table is related to the primary key , You can't just relate to a part of the primary key ( Eliminate partial dependence )
- 3NF: Make sure that every column in the data table is directly related to the primary key , Not indirectly ( Eliminate delivery dependency )
Next , We need an introduction 2 A concept :OLTP and OLAP, These two concepts represent two ways of processing data , These two treatments , This corresponds to the meaning of storing data in databases and data warehouses .
- On line transaction processing OLTP(On-line transaction processing): Daily operation of online database , It is usually a query for a record or a group of records and an update , It mainly serves the specific application of the enterprise , In a transactional environment , What people care about is response time 、 Security and integrity of data .
- On line analytical processing OLAP(On-Line Analytical Processing): It is used for decision analysis of managers , Decision making data are mostly historical 、 Aggregate or computational data 、 It is often necessary to access a large amount of historical data , Support complex query and multidimensional analysis . External data is often used in analytical processing , This part of data is not generated by the transaction processing system , It comes from other external data sources .
OLTP and OLAP The difference between :
characteristic | OLTP | OLAP |
---|---|---|
features | The operation process | Information processing |
user | The clerk 、DBA、 Database professionals | Data Analyst 、 project manager 、 The product manager |
function | Daily operation | Long-term information requirements 、 Decision support |
DB Design | be based on ER Model 、 For application | Dimensional modeling 、 subject-oriented |
data | Current ; Make sure it's up to date | The history of the ; Cross time maintenance |
Summary | The original , Highly detailed | A summary of the 、 A unified |
View | detailed , General relationship | Summary 、 multidimensional |
Work unit | A short 、 Simple transaction | Complex queries |
access | read / Write | Mainly reading |
Focus on | Data into the | Information output |
operation | Index on keyword / hash | A lot of scanning |
Number of access records | Dozens of | Millions of |
The number of users | Thousands of | Hundreds of |
DB scale | 100M To GB | 100GB To TB |
first | High performance 、 High availability | High flexibility , Endpoint user autonomy |
The number of users | Thousands of | Hundreds of |
Measure | Transaction throughput | Query throughput , response time |
- T for transaction, Small amount of data in a single query 、 High concurrency 、 It's very random 、 Millisecond delay 、ACID
- MySQL、PostgreSQL、SQL Server、Oracle
- A for Analysis, Large amount of data 、 Low concurrent 、 Batch 、 Second to minute delay 、 Does not require ACID, Pursuit “ Final consistency ”
- Hive、HBase、Spark、Clickhouse
2. The idea of data warehouse modeling
2.1 ER Model
ER Model , Also called entity relationship model ( Entity Relationship, ER) With entities - Relationship - Entity pattern to describe enterprise business , In the theory of paradigm, it is consistent with 3NF. In the data warehouse 3NF And OLTP In the system 3NF The difference is that , It is a theme oriented abstraction from the perspective of enterprise , Instead of abstracting entity object relationships for a specific business process . It mainly has the following characteristics :
- Need a comprehensive understanding of enterprise business and data ( It's hard to understand everything )
- The implementation cycle is very long .( Not suitable for modification )
- The ability requirements of the modeler are very high .( High level )
use ER The starting point of building data warehouse model is to integrate data , Combine and merge the data in each system according to the theme from the perspective of the whole enterprise , And consistency processing , For data analysis and decision making , But it can't be used directly to analyze decisions .
The modeling process is mainly divided into 3 Stages :
- High level models : A highly abstract model , Describe the main themes and the relationship between them , Used to describe the overall business profile of an enterprise .
- Middle model : Based on the high-level model , Refine the data items of the topic .
- The physical model ( Also known as the underlying model ): On the basis of the middle-level model , Consider physical storage , At the same time, design physical properties based on performance and platform characteristics , It is also possible to merge some tables 、 Partition design, etc .
summary :ER In fact, the model is more like the system design process in the information management system , Need a comprehensive understanding of the business 、 The long real-time cycle is not suitable for enterprises with current development speed and frequent business changes .
2.2 Dimension model
Dimension modeler is the most popular data warehouse modeling method in the field of data warehouse engineering . Dimension modeling starts from analyzing the needs of decision-making to build the model , Serve the analysis of needs , Therefore, it focuses on how users can complete more quickly Demand analysis , At the same time, it has better response performance for large-scale complex queries . The typical representative is the star model , And the snowflake model used in some special scenes . Here we introduce the star model in dimension modeling 、 Snowflake model 、 Constellation model .
- Star model : It is composed of a fact table and multiple dimension tables with non standardized descriptions , Star model can adopt relational database structure , The core of the model is the fact table , Around the fact table is the dimension table .
- Snowflake model : It is an extension of the star model , Each dimension can be externally connected to multiple detailed category tables .
- Constellation model : It's also an extension of the star model . The difference is that there are multiple fact tables in the constellation model , Sharing dimension table information between different fact tables , It is often used in scenarios with more complex data relationships . It is often called a galactic model .
The design is divided into the following steps :
- Select the business processes that need to be analyzed and made decisions . A business process can be a single business event , For example, the payment of a transaction 、 Refund, etc ; It can also be the state of an event , Such as the current account balance ; It can also be a business flow composed of a series of related business events cheng , What we need to see is the occurrence of certain events , Or the current state , Or the efficiency of event circulation .
- Choose the granularity . In event analysis , We need to predict the degree to which all analysis needs to be subdivided, so as to determine the granularity of selection . Granularity is a combination of dimensions
- Identify dimension table . After choosing the granularity , We need to design dimension table based on this granularity , Including dimension attributes , It is used for grouping and screening in analysis .
- Choose the facts . Identify the indicators that need to be measured in the analysis .
2.3 Data Vault Model
Data Vault The model is ER Derivation of model , The starting point of its design is also to achieve data integration , But it can't be directly used for data analysis and decision making . It emphasizes the establishment of an auditable underlying data layer , That is to emphasize the historical nature of data 、 Traceability and atomicity , And it doesn't require excessive consistency processing and integration of data ; At the same time, it organizes enterprise data structurally based on the concept of theme , Further paradigm processing is introduced to optimize the model , In response to swimming 、 Expansibility of system changes . Data Vault The model consists of the following parts :
- Hub: It is the core business entity of the enterprise , from Entity key 、 Data warehouse sequence proxy key 、 Loading time 、 Data sources consist of .
- Link : representative Hub Between Relationship . Here with ER The biggest difference between models is that the relationship is abstracted as an independent unit , It can improve the extensibility of the model . It can directly describe 1 : 1 、 1:n and n:n The relationship between , And there is no need to make any changes . It consists of Hub The surrogate key for 、 Loading time 、 Data sources consist of .
- Satellite : yes Hub A detailed description of , One Hub There can be Multiple Satellite. It consists of Hub The surrogate key for 、 Loading time 、 Source type 、 The details of the Hub Describe the composition of information .
Data Vault Model ratio ER Models are easier to design and produce , its ETL Processing can be configured ,Data Vault The core idea of : Hub Imagine the skeleton of an adult , that Link It's the ligament that connects the skeleton , and SateIIite It's the flesh and blood on the skeleton .
3. The idea of data warehouse layering
3.1 Theoretical basis of data warehouse layering
The data model is divided into several levels , Different companies are divided into different levels according to their business needs , But most of them are in the following levels .
- ODS layer (Operational Data Store): The main function of this level is to store the data directly obtained from the source system ( Data from data structure 、 The logical relationship between the data is basically consistent with the source system ). Realize the data warehouse technology processing of some business system fields 、 A small amount of basic data cleaning ( For example, dirty data filtering 、 Character set conversion 、 Dimension value processing )、 Generate incremental data table .
- DIM layer (Dimension): This layer mainly stores simple 、 static state 、 Dimension table of code class , Ranging from OLTP Layer extraction transform dimension table 、 Dimension tables built according to business or analysis requirements and warehouse technical dimension tables such as date dimension tables .
- DWD layer (Data Warehouse Detail): The main function of this layer is based on the division of topic domains , Business oriented themes 、 Data driven design model , Complete data integration , Provide a unified basic data source . Number completed at this level According to the cleaning 、 redefinition 、 Integrate classification functions .
- DWM layer (Data Warehouse Model): Subject oriented analysis 、 Unified data access , All basic data 、 The basic indicator base and multidimensional model of business rules and business entities are all calculated here 、 Unified Modeling , A large number of basic indicator libraries and multidimensional models are implemented in this layer . This level is driven by analysis requirements for model design , Realize the association calculation or light summary calculation of cross business subject domain data , Therefore, there will be multi table associated summary calculation with large amount of data .
- DM layer (Data Warehouse Model): The main function of this level is to process wide tables with multidimensional redundancy ( Solve complex queries )、 Summary table of multi angle analysis .
- APP layer (Application): The main function of this level is to provide differentiated data services 、 Meet the needs of the business side ; Reports are implemented at this level (tableau、 Email report )、 Self service data retrieval, etc .
3.2 Why layering
Many students know that data warehouse needs layering , But it may not be clear why it is layered , Here are the benefits of layering :
- Clear data structure : Every data tier has its scope , This makes it easier for us to locate and understand when we use tables .
- It's convenient for data and blood relationship tracking : Simply speaking , What we finally present to the business is a business table that can be used directly , But it comes from many sources , If there's a problem with one of the source tables , We want to be able to quickly and accurately locate the problem , And understand the scope of its harm .
- Reduce redevelopment : Standardize data tiering , Develop some common middle tier data , It can reduce the huge repeated calculation .
- Simplify complex problems : Decompose a complex task into multiple steps to complete , Each layer deals with a single step , It's simpler and easier to understand . And it's easy to maintain the accuracy of the data , When the data goes wrong , You don't have to fix all the data , Just start with the problem steps and fix it .
- Shield the exception of the original data : Block the impact of business , You don't have to change the service once, you need to re access the data .
4. Data warehouse development technology
In this part, I will briefly introduce , Formally speaking, the development of data warehouse , It is mainly divided into offline and real-time .
- Offline development of technical components :hadoop、spark、hive、yarn、HDFS、 Data lake, etc
- Real time development technology components :flink、sparkstreaming、kafka etc.
in general , Data warehouse development requires some knowledge of each component , about spark and flink The detailed mechanism of the need for more in-depth understanding , Commonly used is sql, You can also use pysprak Development , Occasionally you need to write some custom functions , Need to be able to solve some data optimization 、 Governance issues , Provide better data assurance for business students .
5. Reference article
- 《 Alibaba big data Road 》
- 《 Data warehouse toolbox 》
- link : Why should data warehouse be layered
边栏推荐
- [microservices sentinel] sentinel data persistence
- Smart wind power | Tupu software digital twin wind turbine equipment, 3D visual intelligent operation and maintenance
- Scu| gait switching and target navigation of micro swimming robot through deep reinforcement learning
- TIME_ Solutions to excessive wait
- 智慧风电 | 图扑软件数字孪生风机设备,3D 可视化智能运维
- 虽然TCGA数据库有33种癌症
- 互联网的发展为产业的变革和转型提供了新的解决方案
- Pat class B 1013
- 股市小白在网上股票开户安全吗?
- RNA-seq入门实战(一):上游数据下载、格式转化和质控清洗
猜你喜欢
[try to hack] kill evaluation
Sécurité, économie de carburant et protection de l'environnement chameau
Flutter series: Transformers in flutter
How to quote Chinese documents when writing a foreign language?
零基础自学SQL课程 | SQL中的日期函数大全
零基础自学SQL课程 | IF函数
零基础自学SQL课程 | SQL基本函数大全
现代编程语言:zig
零基础自学SQL课程 | CASE函数
Zero foundation self-study SQL course | case function
随机推荐
[黑苹果系列] M910x完美黑苹果系统安装教程 – 2 制作系统U盘-USB Creation
How to use raspberry pie (and all kinds of pies)
安全省油环保 骆驼AGM启停电池魅力十足
零基础自学SQL课程 | CASE函数
Feign通过自定义注解实现路径的转义
积分体系和营销活动结合在一起有哪些玩法
零基础自学SQL课程 | SQL基本函数大全
Sentinel
[microservices sentinel] sentinel data persistence
Recyclerview implements grouping effects in a variety of ways
Chapter 2 integrated mp
Storage structure of graph
MySQL character set
代码整洁之道--函数
MySQL enterprise parameter tuning practice sharing
炼金术(4): 程序员的心智模型
往前一步是优秀,退后一步是懵懂
2022 PMP project management examination agile knowledge points (3)
解决新版chrome跨域问题:cookie丢失以及samesite属性问题「建议收藏」
Character interception triplets of data warehouse: substrb, substr, substring