当前位置：网站首页>Data warehouse data processing DB basic concept analysis and understanding OLAP OLTP hatp similarities and differences MPP architecture

Data warehouse data processing DB basic concept analysis and understanding OLAP OLTP hatp similarities and differences MPP architecture

2022-06-24 05:42:00 【Goose】

0. background

When learning counting , Maybe I was always confused by some English abbreviations at the beginning ,OLAP MPP framework KAPPA framework ODS wait , This article will sort out these basic concepts .

The first is the data warehouse layered foundation ：

The warehouse is usually divided into three layers ：ODS（ Raw data ）,DW（ Data warehouse layer ）,ADS（ Application data layer ）.

ODS It's the most original data .

DW Layer is the data after processing the data , It is usually divided into ：DWS and DWD.DWD In the layer is right ODS The data of layer is extracted after cleaning . and DWS Layer is the data after some slight summary .

Users can directly process based on this layer ADS Data required by layer .ADS Layer is to produce the final data required by the application .

0.1 Data operation layer （ODS）

ODS：Operation Data Store Data preparation area , Also known as paste source layer . The data table of the source system of the data warehouse is usually stored intact , This is called ODS layer , It is the source of subsequent data warehouse processing data .
ODS The source of layer data ：
- Business Library
  - Often use sqoop To extract , For example, take a sample regularly every day .
  - In real time , Consider using canal monitor mysql Of binlog, Real time access .
- Burial site log
  - Logs are generally saved in the form of files , You can choose to use flume Timing synchronization
  - It can be used spark streaming perhaps Flink To access in real time
  - kafka also OK
- Message queue ： From ActiveMQ、Kafka Data, etc .

0.2 Data warehouse layer （DW）

DW Data tiering , From bottom to top for DWD,DWB,DWS.

DWD：data warehouse details Detail data layer , It is the isolation layer between business layer and data warehouse . Mainly for ODS The data layer does some Data cleaning and standardization The operation of .
- Data cleaning ： Remove null value 、 Dirty data 、 Beyond the limit range
DWB：data warehouse base Data base layer , It's storing objective data , Generally used as an intermediate layer , It can be considered as the data layer of a large number of indicators .
DWS：data warehouse service Data service layer , be based on DWB Basic data on , Integrate and analyze the service data layer of a certain subject domain , It's usually a wide watch . Used to provide subsequent business queries ,OLAP analysis , Data distribution, etc .
- User behavior , Mild polymerization
- Mainly for ODS/DWD Do some light summary of layer data .

0.3 Data service layer / application layer （ADS）

ADS：applicationData Service Application data services , This layer mainly provides data products and data used for data analysis , Usually stored in ES、mysql And other systems for online systems .
- Through the report data we said , Or the wide watch , It's usually right here

1. OLTP On line transaction processing 、OLAP OLAP and HATP Mixed transaction analysis processing

Data processing can be roughly divided into three categories ： On line transaction processing OLTP（on-line transaction processing）、 On line analytical processing OLAP（On-Line Analytical Processing） Mixed transaction analysis processing with subsequent definitions HATP(Hybrid transaction/analytical processing).

OLTP Is the main application of traditional relational database , Basically 、 Routine business , For example, bank transactions .

OLAP Is the main application of data warehouse system , Support complex analysis operations , Focus on decision support , And provide intuitive and easy to understand query results .

OLTP The system emphasizes the efficiency of database memory , Emphasis on the command rate of various indicators of memory , Emphasis on binding variables , Emphasis on concurrent operations ;

OLAP The system emphasizes data analysis , emphasize SQL Execute market , Emphasize disk I/O, Emphasis on zoning, etc .

Gartner stay 《 Mixed business / Analysis and processing to promote major business innovation 》 Defined in the report 了 HTAP：Hybrid transaction/analytical processing, Mixed business / Analyze and process . Wikipedia will HTAP Defined as “ A single database supports OLTP and OLAP, The ability of real-time intelligent processing ”.

To meet various big data needs , Users want a single database engine , It can :

Meet the needs of all data models , Handle all workloads ( Business 、 operating 、BI And analysis ).

Store all data on a single platform , Not stored on different platforms ( Different platforms use different databases to handle different workloads ).

Reduce data movement and replication 、 Reduce the resulting delays and operating costs .

Leverage operational data 、 The semi-structured and unstructured data stored on the same platform are deeply analyzed and the potential value is mined .

Generate reports and conduct real-time analysis while obtaining data ( No delay ), Seamless integration and access to operations 、 History and Big Data.

2. SMP（ Symmetric multiprocessor architecture ）NUMA（ Inconsistent storage access structure ）MPP（ Large scale parallel processing structure ） contrast

SMP

Symmetric multiprocessor architecture , It means multiple servers CPU Symmetrical work , There is no primary or secondary or subordinate relationship .SMP The main feature of the server is sharing , All the resources in the system （ Such as CPU、 Memory 、I／O etc. ） It's all Shared . It is precisely because of this characteristic , Led to SMP The main problem with the server , That is, the expansion capacity is very limited .

NUMA

Inconsistent storage access structure . This structure is to solve SMP The problem of insufficient expansion capacity , utilize NUMA technology , You can put dozens of CPU In one server .NUMA The basic characteristic of a company is having multiple CPU modular , Nodes can connect and interact with each other through interconnection modules , therefore , Every CPU Can access the memory of the whole system （ This is the MPP The important difference between systems ）. But the speed of access is not the same , because CPU The speed of accessing local memory is much higher than that of other nodes in the system , This is also inconsistent storage access NUMA The origin of .

This structure also has some defects , Because the delay of accessing remote memory far exceeds that of accessing local memory , therefore , When CPU As the number increases , System performance cannot be increased linearly .

MPP

Large scale parallel processing structure .MPP System expansion and NUMA Different ,MPP It's made up of multiple SMP The server is connected through a certain node Internet , Working together , Complete the same task , From the user's point of view, it's a server system . Each node only accesses its own resources , So it's a completely shareless （Share Nothing） structure .

MPP The most scalable structure , Theory can be expanded infinitely . because MPP It's multiple SPM Server connected , For each node CPU Can't access another node's memory , So there is no problem of remote access .

MPP Within each node in CPU Can't access the memory of another node , The information interaction between nodes is realized through the node Internet , This process is called data redistribution .

however MPP The server needs a complex mechanism to schedule and balance the load and parallel processing of each node . at present , Some are based on MPP Technical servers often use system level software （ Such as a database ） To shield this complexity . for instance ,Teradata Is based on MPP Technology of a relational database software （ It was the first to adopt MPP Architecture database ）, When developing applications based on this database , No matter how many nodes the backend server consists of , Developers face the same database system , Without considering how to schedule the load of some nodes .

MPP Architecture features ：

Tasks are executed in parallel
Data distributed storage （ localization ）
Distributed computing
High concurrency
The concurrent ability of a single node is greater than 300 user
Horizontal scaling
Support cluster node expansion
Shared Nothing（ No sharing at all ） framework

3. The batch MR MPP contrast

	Batch Architecture （ Such as MapReduce）	MPP framework
advantage	If a Executor Execution is too slow , So this Executor It's going to be distributed to fewer people task perform , The batch architecture has a speculative execution strategy , Guess something Executor Slow or faulty execution , And then we'll assign task It will be less allocated to it or not allocated directly , In this way, the performance of the cluster will not be limited due to the problem of a node .	MPP Architecture does not need to write intermediate data to disk , Because of a single Executor Only deal with a single task, So you can simply put the data stream To the next stage of execution . This process is called pipelining, It provides a big performance boost .
Inferiority	Its advantages also cause its disadvantages , Intermediate results are written to disk , This severely limits the performance of processing data .	about MPP In terms of Architecture , because task and Executor Is bound. , If a Executor Slow or faulty execution , The performance of the whole cluster will be limited by the execution speed of the failed node , therefore MPP The biggest flaw in architecture is —— Short board effect . Another point , More nodes in the cluster , The higher the probability that a node has a problem , And once there is a node problem , about MPP In terms of Architecture , This will result in limited performance of the entire cluster , So in general, in actual production MPP There should not be too many cluster nodes in the architecture .

The same thing ：

Batch architecture and MPP Architecture is distributed parallel processing , Distribute tasks to multiple servers and nodes in parallel , After calculation on each node , Sum up the results of each part to get the final result .

Difference ：

Batch architecture and MPP The differences in architecture are, for example ： We're on a mission , First of all, the task will be divided into several task perform , about MapReduce Come on , these tasks Randomly assigned in idle Executor On ; And for MPP Architecture engine , Every one that processes data task Is bound to the specified Executor On .

For example, the data of the next two architectures fall into the disk ： To achieve two large tables join operation ,

For batch processing , Such as Spark The disk will be written three times （ First write ： surface 1 according to join key Conduct shuffle; Second write ： surface 2 according to join key Conduct shuffle; Write the third time ：Hash Tables are written to disk ）
and MPP It only needs to be written once （Hash Table write ）. This is because MPP take mapper and reducer Running at the same time , and MapReduce Divide them into dependent ones tasks（DAG）, these task It's asynchronous , Therefore, we must solve the data dependence by writing intermediate data to share memory .

4. MPP framework OLAP engine

4.1 Only responsible for calculation , Not responsible for storage

Impala

Apache Impala Is to use MPP Architecture query engine , It doesn't store any data on its own , Direct use of memory for calculation , Take into account the data warehouse , With real time , The batch , Many concurrent advantages .

Provides a class SQL（ class Hsql） grammar , It can also have high response speed and throughput in multi-user scenarios . It is from Java and C＋＋ Realized ,Java The interface and implementation of query interaction provided ,C＋＋ The query engine part is implemented .

Impala Support sharing Hive Metastore, But no longer use slow Hive＋MapReduce The batch , Instead, it uses distributed query engines similar to those in commercial parallel relational databases （ from Query Planner、Query Coordinator and Query Exec Engine Three parts ）, Can be directly from HDFS or HBase of use SELECT、JOIN And statistical functions to query data , This greatly reduces latency .

Impala Often with storage engine Kudu To provide services together , The biggest advantage of this is that the query is faster , And support data Update and Delete.

Presto

Presto It's a distributed adoption MPP Architecture query engine , It doesn't store data on its own , But you can access multiple data sources , It also supports cascading queries across data sources .Presto It's a OLAP Tools for , Good at complex analysis of massive data ; But for OLTP scene , Not at all Presto Be good at , So don't put Presto Use as a database .

Presto It's a memory computing engine with low latency and high concurrency . Need to get data from other data sources for operational analysis , It can connect multiple data sources , Include Hive、RDBMS（Mysql、Oracle、Tidb etc. ）、Kafka、MongoDB、Redis etc. .

4.2 I'm responsible for the calculation , Also responsible for storage

1． ClickHouse

ClickHouse It is an open source columnar database which has attracted much attention in recent years , Mainly used for data analysis （OLAP） field .

It has its own storage and computing power , Fully autonomous and highly available , And support complete SQL Grammar includes JOIN etc. , There are obvious technical advantages . Compared with hadoop system , Big data processing in the way of database is more simple and easy to use , The cost of learning is low and the flexibility is high . At present, the community is still developing rapidly , And in the domestic community is also very hot , Large factories have followed up the large-scale use of .

ClickHouse Very detailed work has been done in the computing layer , Squeeze as much hardware as you can , Increase query speed . It realizes multi-core parallel in a single machine 、 Distributed computing 、 Vectorization execution and SIMD Instructions 、 Code generation and other important technologies .

ClickHouse from OLAP The scene needs to start , A new efficient column storage engine is developed , And realize the orderly storage of data 、 primary key 、 Sparse index 、 data Sharding、 data Partitioning、TTL、 Master / slave replication and other rich functions . All of the above functions are ClickHouse Extremely fast analytical performance lays the foundation .

2． Doris

Doris It's led by Baidu , according to Google Mesa Papers and Impala Project rewriting is a big data analysis engine , It is a massive distributed system KV The storage system , It's designed to support medium-sized, highly available, scalable KV Storage cluster .

Doris It can realize mass storage , Linear scaling 、 Smooth expansion , Automatic fault tolerance 、 Fail over , High concurrency , And the operation and maintenance cost is low . Scale of deployment , It is recommended to deploy 4－100＋ Servers .

Doris3 The main structure of ：DT（Data Transfer） Responsible for data import 、DS（Data Seacher） The module is responsible for data query 、DM（Data Master） Module is responsible for cluster metadata management , The data is stored in Armor Distributed Key－Value In the engine .Doris3 rely on ZooKeeper Store metadata , So other modules depend on ZooKeeper It's stateless , And then the whole system can achieve a single point without failure .

3． Druid

Druid It's open source 、 Distributed 、 Real time analysis data storage system for column storage .

Druid The key features of are as follows ：

Sub second OLAP Query analysis ： Column storage is adopted 、 Inverted index 、 Bitmap index and other key technologies ; Filter massive data in sub second level 、 Aggregation and multidimensional analysis ; Real time streaming data analysis ：Druid Provides real-time streaming data analysis , And efficient real-time writing ; Visualization of real-time data in sub seconds ; Rich data analysis function ：Druid Provides a friendly visual interface ;SQL query language ; High availability and high scalability ：Druid Single function of work node , Don't depend on each other ;Druid Cluster in management 、 Fault tolerance 、 Disaster preparedness 、 It's easy to expand ;

4． TiDB

TiDB yes PingCAP The company designed independently 、 Research and development of open source distributed relational database , It's a product that supports OLTP And OLAP Integrated distributed database products .

TiDB compatible MySQL 5．7 The protocol and MySQL Ecological and other important characteristics . The goal is to provide users with one-stop OLTP 、OLAP 、HTAP Solution .TiDB For high availability 、 Strong consensus requires higher requirements 、 Large data scale and other application scenarios .

5． Greenplum

Greenplum It's open source PostgreSQL On the basis of MPP The performance of the architecture is very powerful for relational distributed databases . For compatibility Hadoop ecology , Has introduced HAWQ, The analysis engine retains Greenplum High performance engine for , The lower layer storage no longer uses the local hard disk, but uses the local hard disk HDFS, Avoid the problem of poor reliability of local hard disk , Blend in at the same time Hadoop ecology .

4.3 summary

5. Lambda Kappa Omega framework

5.1 Lambda framework

Lambda The architecture is mainly composed of these parts ： data source （Kafka）, Data processing （Storm,Hadoop）, Service database （Serving DB）. The data source and service database are the entry and exit of the whole architecture data . Data processing is divided into online processing and offline processing .

When the data goes through kafka Message middleware , Get into Lambda After the architecture , Will enter offline processing at the same time （Hadoop） And real time processing （Storm） Two processing modules . Offline processing for batch calculation , There will be a lot of T+1 To summarize the data . Real time processing is stream processing or micro batch processing , Calculate seconds 、 Minute results . Finally, they are entered into the service database （Serving DB） Summary in , Exposed to the upper service call .

Lambda The benefits of architecture are ： Simple architecture , It combines the advantages of offline batch processing and real-time stream processing , Stable and real-time computing with controllable cost .

Besides , It is also very friendly to data revision . If the statistical caliber of later data changes , Rerun offline tasks , You can quickly revise the historical data to the latest caliber .

However ,Lambda There are also many problems , The most prominent problem is that it is necessary to maintain both the real-time processing and offline processing codes and ensure that the two processing results are consistent . This is undoubtedly very troublesome .

5.2 Kappa framework

Kafka Or other message oriented middleware , Ability to retain multi day data . Under normal circumstances kafka It's all spitting out real-time data , After real-time processing system , Enter the service database （Serving DB）.

When the system needs data correction , Replay message , Fix the real-time processing code , Expand the concurrency of real-time processing system , Quickly go back to historical data .

The architecture is so simple , It avoids the problem of maintaining the two systems and keeping the results consistent , It also solves the problem of data revision .

But it also has its problems ：

1、 There is a performance bottleneck in the amount of data cached and backtracking data in message oriented middleware . Usually the algorithm needs to pass 180 Days of data , If message oriented middleware exists , There is no doubt a lot of pressure . meanwhile , One time retrospective correction 180 Day level data , The resource consumption of real-time computing is also very large .

2、 In real-time data processing , When a large number of different real-time streams are associated , Very dependent on the capabilities of real-time computing systems , Probably because of the sequence of data flow , Cause data loss .

for example ： Consumers search for a product on Taobao . Normally , In the search results , Product exposure data should be earlier than user click data output . However, there may be system delays , As a result, the exposure data of the same commodity enters the real-time processing system later than the click data . If developers are not aware of such a problem , It is likely that the code will be designed to expose data and wait for click data to be associated . Click data that is not associated with exposure data can easily be discarded by some simple conditional judgment statements .

For offline processing , Messages are batch processed , There is no situation where there is no connection . stay Lambda Under the architecture , Even if some real-time data processing is lost , But because offline data is absolutely dominant , So it has little impact on the overall results . Even if there are problems with the real-time processing results of the day , It will also be overwritten by the correct results of offline processing the next day . Ensure that the final result is correct .

5.3 summary

Sort it out Lambda The architecture and Kappa Advantages and disadvantages of Architecture ：

framework	advantage	shortcoming
Lambda	1、 Simple architecture 2、 It combines the advantages of offline batch processing and real-time stream processing 3、 Stable and real-time computing with controllable cost 4、 Offline data is easy to correct	1、 real time 、 Offline data is difficult to maintain consistent results 2、 Two systems need to be maintained
Kappa	1、 Only need to maintain the real-time processing module 2、 You can replay through messages 3、 No need for offline real-time data consolidation	1、 Strong reliance on Message Oriented Middleware caching capabilities 2、 Data loss may occur during real-time data processing .

6. Ref

原网站

版权声明
本文为[Goose]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/08/20210805190122059U.html