当前位置：网站首页>Clickhouse uses distributed join of pose series

Clickhouse uses distributed join of pose series

2022-06-24 12:21:00 【fastio】

JOIN Operation is OLAP The scene cannot be bypassed , And widely used operation . Yes ClickHouse for , It is very necessary to deal with distributed systems JOIN Make an in-depth study of the implementation .

In introducing distributed JOIN Before , Let's see. ClickHouse stand-alone JOIN How is it realized .

1. ClickHouse stand-alone JOIN Realization

ClickHouse stand-alone JOIN Operation default HASH JOIN Algorithm , Optional MERGE JOIN Algorithm . among ,MERGE JOIN Algorithm data will overflow to disk , Poor performance compared to the former . In this paper, we focus on HASH JOIN Implementation of algorithm JOIN operation .

ClickHouse JOIN The query syntax is as follows ：

SELECT <expr_list>
FROM <left_table>
[GLOBAL] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI|ANY|ASOF] JOIN <right_table>
(ON <expr_list>)|(USING <column_list>) ...

ClickHouse Of HASH JOIN The algorithm is simple to implement ：

from right_table Read the full data of the table , Build in memory HASH MAP;
from left_table Reading data in batches , according to JOIN KEY To HASH MAP To find , If hit , Then the data is used as JOIN Output ;

1.jpg

As can be seen from this implementation , If right_table The amount of data exceeds the limit of available memory space of a single machine , be JOIN The operation cannot be completed . Usually , The two tables JOIN when , Use smaller tables as right_table.

2. ClickHouse Distributed JOIN Realization

ClickHouse It's a decentralized architecture , It's very easy to scale the cluster horizontally . When providing services in cluster mode , Distributed JOIN Queries cannot be avoided . Distributed here JOIN Usually refers to ,JOIN The... Involved in the query left_table And right_table It's a distributed table .

Usually , Distributed JOIN The implementation mechanism is nothing more than the following ：

Broadcast JOIN
Shuffle Join
Colocate JOIN

ClickHouse The cluster does not achieve a complete sense of Shuffle JOIN, Implemented classes Broadcast JOIN, By completing data redistribution in advance , Can achieve Colocate JOIN.

ClickHouse Distributed JOIN Queries can be divided into two categories , belt GLOBAL Keywords , And without GLOBAL Keyword situation .

2.1 Ordinary JOIN Realization

2.1 Described in GLOBAL JOIN The implementation of the . Next, let's look at none GLOBAL Keywords JOIN How to achieve it ：

a. initiator take SQL S Replace the local table with the corresponding table in the distributed table , formation S'
b. initiator take a. Medium S' Distributed to each node of the cluster
c. The cluster node executes S', And summarize the results to initiator node
d. initiator The node returns the result to the client

If the right table is a distributed table , Then each node in the cluster will execute distributed queries . There will be a very serious read amplification phenomenon . Suppose the cluster has N Nodes , The right table query will be executed in the cluster N*N Time .

a.png

As shown in the figure , Executive SQL by ：

SELECT a_.i, a_.s, b_.t FROM a_all as a_ JOIN b_all AS b_ ON a_.i = b_.i

among ,a_all, b_all For distributed tables , The corresponding local table name is a_local, b_local. Then change SQL The timing of distributed execution is ：

1）initiator Received a query request SELECT a_.i, a_.s, b_.t FROM a_local AS a_ JOIN b_all as b_ ON a_.i = b_.i That is, the left distributed table is changed to the local table name . The SQL Execute in parallel within the cluster .
2) initiator Execute distributed queries , This node and other nodes execute
3） The cluster node received 2） in SQL after , When analyzing the right table, the distributed table , Then a distributed query is triggered ：SELECT b_.i, b_.t FROM b_local AS b_ Each node of the cluster executes the task concurrently , And merge the results , Write it down as subquery.
4） The cluster node is finished 3） in SQL After execution , perform SELECT a_.i, a_.s, b_.t FROM a_local AS a_ JOIN subquery as b_ ON a_.i = b_.i among subquery Express 2 Results of execution in
5) The execution of each node is completed JOIN After calculation , towards initiator Node sends data

It can be seen that ,ClickHouse General distributed JOIN Query is a simple version of Shuffle JOIN The implementation of the , Or an incomplete implementation . What is incomplete is , Not according to JOIN KEY Go to Shuffle data , Instead, each node pulls all the data in the right table . In fact, there is room for optimization .

In the production environment , The impact of query amplification on query performance can not be ignored .

2.2 GLOBAL JOIN Realization

GLOBAL JOIN The calculation process is as follows ：

a. If the right table is a subquery , be initiator Complete subquery calculation ;
b. initiator Send the data in the right table to other nodes in the cluster ;
c. The cluster node compares the data of the left table with that of the local table and the right table JOIN Calculation ;
d. Other nodes in the cluster send the results back to initiator node ;
e. initiator Summarize the results , Send to client ;

GLOBAL JOIN Can be seen as an incomplete Broadcast JOIN Realization . If JOIN The right table has a large amount of data , It will occupy a lot of network bandwidth , Results in query performance degradation .

b.jpg

As shown in the figure , Executive SQL by ：

SELECT a_.i, a_.s, b_.t FROM a_all as a_ GLOBAL JOIN b_all AS b_ ON a_.i = b_.i

among ,a_all, b_all For distributed tables , The corresponding local table name is a_local, b_local. Then change SQL The timing of distributed execution is ：

1) initiator Received a query request SELECT b_.i, b_.t FROM b_local AS b_ That is, the left distributed table is changed to the local table name . The SQL Execute in parallel within the cluster . Summarize the results , Record as subquery.
2) initiator And other nodes in the cluster
3）initiator take 2) in subquery Send to other nodes in the cluster , And trigger distributed query ：SELECT a_.i, a_.s, b_.t FROM a_local AS a_ JOIN subquery as b_ ON a_.i = b_.i among subquery Express 2) Results of execution in
4) The execution of each node is completed JOIN After calculation , towards initiator Node sends data

It can be seen that ,GLOBAL JOIN Put the query of the right table in initiator After completion on the node , Send to other nodes through the network , Avoid double calculation of other nodes , So as to avoid query amplification .

3. Distributed JOIN Best practices

It's clear ClickHouse Distributed JOIN After the query is implemented , Let's sum up some practical experience .

One 、 Try to reduce JOIN The amount of data in the right table

ClickHouse according to JOIN Data in the right table , structure HASH MAP, And will SQL All the required columns in the are read into memory . If the amount of data in the right table is too large , The node memory cannot hold the post , Unable to complete calculation .

In practice, , We usually use the smaller table as the right table , And increase the filtering conditions as much as possible , Reduce entry JOIN The amount of data calculated .

Two 、 utilize GLOBAL JOIN Avoid performance loss caused by query amplification

If the amount of data in the right table or sub query is controllable , have access to GLOBAL JOIN To avoid reading and amplifying . It should be noted that ,GLOBAL JOIN Will trigger data propagation between nodes , Occupy part of the network traffic . If the amount of data is large , It also brings performance loss .

3、 ... and 、 Data pre distribution implementation Colocate JOIN

When JOIN When the amount of table data involved is very large , Reading amplification , Or network broadcasting brings huge performance loss , We need to take another way to complete JOIN To calculate the .

according to “ identical JOIN KEY Must be the same piece ” principle , We are going to deal with JOIN Calculated table , Press JOIN KEY Partition in the cluster dimension . Will be distributed JOIN Turn to the local of the node JOIN, Greatly reduce the query amplification problem .

If you do the following ：

Will involve JOIN Press... For your watch JOIN KEY Fragmentation
according to 2.2 Section description , take JOIN It is expected that the middle right table will be replaced with the corresponding local table

c.jpg

As shown in the figure , Executive SQL by ：

SELECT a_.i, a_.s, b_.t FROM a_all as a_ JOIN b_local AS b_ ON a_.i = b_.i

among ,a_all, b_all For distributed tables , The corresponding local table name is a_local, b_local. Then change SQL The timing of distributed execution is ：

1) initiator Received a query request SELECT a_.i, a_.s, b_.t FROM a_local AS a_ JOIN b_local as b_ ON a_.i = b_.i
2) initiator Launch a distributed query , The native and other nodes execute ：
3) The execution of each node is completed JOIN After calculation , towards initiator Node sends data

Due to the data and pre segmentation , same JOIN KEY The corresponding data must be together , Does not exist across nodes , Therefore, there is no need to do distributed query on the right table , You can also get the right results .