当前位置:网站首页>Technology implementation | Apache Doris cold and hot data storage (I)

Technology implementation | Apache Doris cold and hot data storage (I)

2022-06-24 19:20:00 ApacheDoris

1.    Preface


For any kind of database software , Whether it is based on traditional database model or distributed structure , The core is always the data itself . And the life cycle of data , Is reflected in CRUD operation ( establish 、 Inquire about 、 to update 、 Delete ) On . Any piece of data starts from the moment it is generated , The value of data decreases over time , Until it becomes useless data , Finally delete .

As the subject of using data —— user , The degree of demand for various data is different , People tend to be more efficient with important data 、 Stable access requirements ; There is no such high requirement for unimportant data , The cost of the former is often much higher than that of the latter . Users meet their own requirements for data use , Naturally, they will start to consider the cost of data storage , For data that is rarely accessed or even rarely accessed , Using lower cost storage would be a better choice .

For such usage scenarios , We divide the data into “ Thermal data ” And “ Cold data ”. seeing the name of a thing one thinks of its function ,“ Thermal data ” It means that users need to visit it more frequently ,“ Cold data ” Rarely visited . General data is often created in a new way “ Thermal data ”, As time goes on, it gradually becomes “ Cold data ”.

2.    The combination of cold and hot data and storage and calculation separation


2.1.   The old-fashioned separation of deposit and settlement

Talking about hot and cold data storage , I have to mention “ Deposit is separate ”, There are many intersections and correlations between the two , The separation of storage and computing is to separate the storage node from the computing node , Drive more storage resources with limited computing power , To achieve lower costs .

The concept of separation of storage and computation has become very popular in recent years , However, as early as more than ten years ago, the separation of deposit and settlement had already existed , impala The earliest versions have stored data in HDFS On , from impala Remotely loaded into the cache for calculation , This is the most typical separation of deposit and settlement . But over the years , Why has the separation of deposit and settlement been repeatedly proposed ? The most important reason , That is, the pure separation of deposit and calculation has huge defects : It sacrifices query efficiency , And it is very troublesome to modify the data , This is unacceptable for a data warehouse that pursues query efficiency , In the early impala It can't be popularized , This is a very important factor .

The separation of storage and computing is definitely not just a read-write process HDFS The file interface is so simple , More attention needs to be paid to the data management strategy .


2.2.   Cold and hot data storage based on the separation model of storage and calculation

about DORIS Come on , Back in the first version “ Baidu PALO1.0” When I was young impala Lessons in this regard , Data storage is not used HDFS Wait for remote storage , The local disk is used , By controlling and scheduling local storage , Achieve the goal of fast access , It also makes DORIS The basic query speed can reach a satisfactory level , It's also DORIS Own advantages . Because of that , In order to maximize this advantage , At the same time, it can also obtain the advantages of separation of deposit and settlement and low cost , It is necessary to introduce the concept of hot and cold data .

For thermal data , Its frequency of access is very high , And it is often the data that users care about very much , In fact, the requirements for timeliness are generally very high , And the frequency of reading and writing will be higher , That's exactly what it is. DORIS Local storage focuses on solving problems .

For cold data , The amount of data is often much larger than the thermal data , And rarely visited , Using local storage is expensive , In this case, the deposit settlement separation model is used , Storing it on a lower cost storage carrier will greatly reduce costs .

3.   DORIS Hot and cold data storage scheme


3.1.  Local hot and cold data storage ( Old hot and cold data scheme )


DORIS It has already tried to realize hot and cold data storage , The earliest cold and hot data storage uses the local cold and hot data model , The core idea is : In the cluster BE Attach multiple hard disks to the node , One part is used to store thermal data SSD disc , The other part is to store cold data HDD disc . stay CREATE TABLE when , Users according to their own needs , Appoint TABLE Is the heat data sheet , And specify the time when it is converted to cold data .

As shown in the figure below , When creating a heat data table ,FE Will be mounted from SSD Discoid BE node (BE1,BE2,BE3) Select multiple at random to create data slices . When the specified cooling time is reached ,SSD The data on the disk will be copied to HDD disc , And update its metadata information .

The benefits of local data storage are , The data is still stored locally , It has little impact on data query efficiency . And because of the use of HDD disc , Data processing logic and SSD Same disk , Therefore, there is no need to write another set of processing logic when changing the data structure .

The disadvantage is that , Even if HDD The price ratio of the offer SSD Much lower , But there is no qualitative change , At the same time, the expansion is inconvenient 、 Load balancing is troublesome . It can not fundamentally solve the cost problem .


3.2.  Cold and hot data model based on the separation of storage and calculation

The new cold and hot data scheme is established on the basis of integrating the storage and calculation separation model , The core idea is :DORIS Local storage as the carrier of thermal data , The external cluster (HDFS、S3 etc. ) As the carrier of cold data . Data is being imported , First exists as thermal data , Store in BE On the local disk of the node . When the data needs to turn cold , Create a copy slice of the cold data for the hot data slice , Then dump the data to the external cluster specified by the cold data , When the cold data copy is generated , Delete the thermal data piece by piece .

As shown in the figure below , When the data becomes cold ,BE The metadata information of a cold data will be kept locally . When a query hits cold data ,BE The cold data will be cached locally for use through this metadata information .

For cold data , The frequency of its use is very low , In this way, we can use limited BE Node to manage more data , The cost will be much lower than that of pure local storage .


4.    summary

This paper introduces DORIS The overall scheme to realize the hot and cold data , Limited space , There are many aspects of hot and cold data management that have not been mentioned . For example, how to handle when new data is written in the cold data partition 、 How to handle the cold data when the table structure is modified , Cleaning strategy of local data cache, etc . These problems will be gradually expanded in the following articles , Stay tuned .


-  do person Medium Shao  -

Pengxiangyu
Baidu PALO Senior R & D Engineer of the team , Rich experience in big data engineering research and development ,Apache Doris The main implementers of hot and cold memory modules , Good at Doris Research and development of ecological components .


   Apache Doris Open source community link reference


Apache Doris Official website :

http://doris.apache.org

Apache Doris Github

https://github.com/apache/incubator-doris

Apache Doris  Developer mail group :

[email protected] 


This article is from WeChat official account. - ApacheDoris(gh_80d448709a68).
If there is any infringement , Please contact the [email protected]a.cn Delete .
Participation of this paper “OSC Source creation plan ”, You are welcome to join us , share .

原网站

版权声明
本文为[ApacheDoris]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206241709225868.html