当前位置：网站首页>Clickhouse high performance column storage core principle

Clickhouse high performance column storage core principle

2022-06-24 16:35:00 【Goose】

ClickHouse It is an open source columnar database which has attracted much attention in recent years , Mainly used for data analysis （OLAP） field . At present, large-scale use has been followed up by domestic manufacturers ：

Internal use of today's headlines ClickHouse To do user behavior analysis , There are thousands of them inside ClickHouse node , Single cluster is the largest 1200 node , The total amount of data is dozens of PB, Increasing raw data 300TB about .
Tencent internal use ClickHouse Do game data analysis , And established a set of monitoring and maintenance system for it .
Ctrip from inside 18 year 7 Access trial started in January , at present 80% Business is running in ClickHouse On . More than one billion data are added every day , Nearly a million query requests .
Kwai is also using inside ClickHouse, The total amount of storage is about 10PB, Add... Every day 200TB, 90% The query is less than 3S.
Alibaba has developed its own cloud database ClickHouse, And it is widely used in many businesses including mobile Taobao traffic analysis .

Beyond seas ,Yandex There are hundreds of nodes inside for user click behavior analysis ,CloudFlare、Spotify The head company is also using .

In just a few years of open source ,ClickHouse He captured many large factories “ Hearts ”, And in Github It's more active than many classic open source projects , Such as Presto、Druid、Impala、Geenplum etc. ; Its popularity and the popularity of the community can be seen .

And one of the important reasons behind these phenomena is its extreme performance , Greatly accelerated the speed of business development , This paper attempts to interpret ClickHouse Design and implementation of storage layer , Analyze its performance .

ClickHouse Component architecture

The figure below is a typical ClickHouse Cluster deployment structure chart , classically share-nothing framework .

Reading a text is favored by big factories ClickHouse Core principles of high performance train storage

The whole cluster is divided into several shard（ Fragmentation ）, Different shard The data is isolated from each other ; In a shard Inside , One or more... Can be configured replica（ copy ）, Copies of each other 2 individual replica Final consistency is maintained through proprietary replication agreements between .

ClickHouse According to the table engine, the table is divided into local table and distributed table , Both tables need to be created on all nodes when creating tables . The local table is only responsible for the current server Write on 、 Query request ; Distributed tables follow specific rules , The write request and the query request are disassembled , Distribute to all server, And finally summarize the request results .

ClickHouse Write link

ClickHouse Provide 2 Writing methods ,1） Write local tables ;2） Write distributed tables .

How to write local tables , The business layer needs to be aware of all the underlying server Of IP, And self-processing data fragmentation operation . Because each node can write directly to , In this way, the overall write ability of the cluster is directly proportional to the number of nodes , Provides very high throughput and customization flexibility . But relatively speaking , It also increases the dependence of the business layer , Introducing more complexity , Especially nodes failover Fault tolerant processing 、 Scaling data re-balance、 Write and query need to use different table engines, etc., which should be handled by themselves in business .

Writing distributed tables is relatively simple , The business layer only needs to write data to a single endpoint And a single distributed table , There's no need to sense the bottom server Topology and other implementation details . Writing distributed tables also has good performance , In business scenarios that do not require very high write throughput , It is recommended to write directly to the distributed table to reduce the business complexity .

The following describes the principle of writing distributed tables .

ClickHouse Use Block As the core abstraction of data processing , Data representing multiple columns in memory , The column data is also stored in the column format in memory . The schematic diagram is as follows ： among header Section contains block Relevant meta information , and id UInt8、name String、_date Date Is the data representation of three different types of columns .

Reading a text is favored by big factories ClickHouse Core principles of high performance train storage

stay Block above , Encapsulates the ability to stream IO Of stream Interface , Namely IBlockInputStream、IBlockOutputStream, Different interfaces correspond to different functions .

When I received INSERT INTO When asked ,ClickHouse Will construct a complete stream pipeline, every last stream Implement corresponding logic ：

InputStreamFromASTInsertQuery        # take insert into The request is encapsulated as InputStream As a data source 
-> CountingBlockOutputStream         # Statistical writing block count
-> SquashingBlockOutputStream        # Accumulate and write block, Until a specific memory threshold is reached , Improve write throughput 
-> AddingDefaultBlockOutputStream    # use default Values complete missing Columns 
-> CheckConstraintsBlockOutputStream # Check whether the various constraints are met 
-> PushingToViewsBlockOutputStream   # If there is a materialized view , Write the data to the materialized view 
-> DistributedBlockOutputStream      # take block Write to a distributed table

notes ：* Slide left and right to read

In the above process ,ClickHouse Great attention to detail optimization , Think about performance everywhere . stay SQL When parsing ,ClickHouse It's not going to be complete all at once INSERT INTO table(cols) values(rows) End of analysis , Instead, read first insert into table(cols) These short head information to build block structure ,values Part of the large amount of data is analyzed by streaming , Reduce memory overhead . In more than one stream Passed between block when , Realized copy-on-write Mechanism , Minimize memory copies as much as possible . Column storage structure is used in memory , For the subsequent direct drop disk on the disk for the column format to prepare .

SquashingBlockOutputStream The client's several lowercase , Into big batch, Improve write throughput 、 Reduce write amplification 、 Speed up data Compaction.

By default , Distributed table writes are forwarded asynchronously .

DistributedBlockOutputStream take Block According to the table DDL Rules specified in （ Such as hash or random） Cut into pieces , Each partition corresponds to a local subdirectory , Drop the corresponding data into a subdirectory .bin file , When the write is complete, it returns client success . Then the background thread of the distributed table , Scan these folders and put .bin The file is pushed to the corresponding partition server..bin The file storage format is shown below ：

Reading a text is favored by big factories ClickHouse Core principles of high performance train storage

ClickHouse Storage format

ClickHouse Using column storage format as stand-alone storage , And we use classes LSM tree To organize and merge . a sheet MergeTree Local table , The file composition from disk is shown in the figure below .

Reading a text is favored by big factories ClickHouse Core principles of high performance train storage

The data of the local table is divided into multiple Data PART, Every Data PART Corresponding to a disk directory .Data PART After dropping the plate , Namely immutable Of , No more changes .ClickHouse The backstage will dispatch MergerThread Will be more than one small Data PART Keep merging , Form bigger Data PART, Higher compression ratio 、 Faster query speed . When doing this to the local table once at a time insert When asked , And there will be a new Data PART, That is, add a new directory . If insert Of batch size Too small , And insert The frequency is very high , It may cause too many directories to run out inode, It also reduces the performance of background data consolidation , That's why ClickHouse It is recommended to use big batch Write with no more than... Per second 1 Why .

stay Data PART The data of each column is stored inside , Due to the use of column format , So different columns use completely separate physical files . Each column has at least 2 Documents make up , Namely .bin and .mrk file . among .bin It's a data file , Keep the actual data; and .mrk It's a metadata file , That holds the data metadata. Besides ,ClickHouse And support primary index、skip index And so on , So there may be a corresponding pk.idx,skip_idx.idx file .

In the process of data writing , The data is based on index_granularity Cut into multiple particles （granularity）, The default value is 8192 A row corresponds to a particle . Multiple particles in memory buffer I've accumulated a certain amount of money in my life （ By the parameter min_compress_block_size control , Default 64KB）, Data compression is triggered 、 Drop plate and other operations , To form a block. One for each particle mark, The mark It mainly stores 2 Item information ：1） At present block In the compressed physical file offset,2） At present granularity After decompression block Medium offset. therefore Block yes ClickHouse With disk IO Interaction 、 Compress / The smallest unit of decompression , and granularity yes ClickHouse The smallest unit of data scanning in memory .

If there is ORDER BY key or Primary key, be ClickHouse stay Block Before the data goes down , The data will be processed according to ORDER BY key Sort . primary key pk.idx There's every... Stored in mark Corresponding to the first line of data , That is, the minimum value of each column in each particle .

When there are other types of sparse indexes , There will be an extra one <col>_<type>.idx file , It is used to record the statistical information of corresponding particles . such as ：

minmax It records the smallest size of each particle 、 Maximum ;
set It records the... In each particle distinct value ;
bloomfilter The approximate algorithm will be used to record the corresponding particles , Whether a value exists ;

Reading a text is favored by big factories ClickHouse Core principles of high performance train storage

When looking up , If query Contains the primary key index condition , First of all in pk.idx To do a binary search in , Find the right particles mark, And from mark Get in file block offset、granularity offset And so on , And then read the data from the disk into the memory to search . Allied , If the condition hits skip index, With the help of index Medium minmax、set Etc , Locate the particles that meet the conditions mark, And then perform IO operation . With the help of mark file ,ClickHouse After locating the eligible particles , Particles can be evenly distributed to multiple threads for parallel processing , Maximize the use of disk IO Throughput and CPU The ability of multi-core processing .

summary

This article mainly from the overall architecture 、 Write link 、 Storage format and so on ClickHouse The design of the storage layer ,ClickHouse Clever combination of column storage 、 Sparse index 、 Multi core parallel scanning technology , Maximize squeeze hardware capabilities , stay OLAP The advantage of the scene is very obvious .

原网站

版权声明
本文为[Goose]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/04/20210416174450702Q.html

当前位置：网站首页>Clickhouse high performance column storage core principle

Clickhouse high performance column storage core principle

ClickHouse Component architecture

ClickHouse Write link

ClickHouse Storage format

summary

边栏推荐

猜你喜欢

随机推荐