当前位置:网站首页>Clickhouse uses distributed join of pose series

Clickhouse uses distributed join of pose series

2022-06-24 12:21:00 fastio

JOIN Operation is OLAP The scene cannot be bypassed , And widely used operation . Yes ClickHouse for , It is very necessary to deal with distributed systems JOIN Make an in-depth study of the implementation .

In introducing distributed JOIN Before , Let's see. ClickHouse stand-alone JOIN How is it realized .

1. ClickHouse stand-alone JOIN Realization

ClickHouse stand-alone JOIN Operation default HASH JOIN Algorithm , Optional MERGE JOIN Algorithm . among ,MERGE JOIN Algorithm data will overflow to disk , Poor performance compared to the former . In this paper, we focus on HASH JOIN Implementation of algorithm JOIN operation .

ClickHouse JOIN The query syntax is as follows :

SELECT <expr_list>
FROM <left_table>
[GLOBAL] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI|ANY|ASOF] JOIN <right_table>
(ON <expr_list>)|(USING <column_list>) ...

ClickHouse Of HASH JOIN The algorithm is simple to implement :

  • from right_table Read the full data of the table , Build in memory HASH MAP;
  • from left_table Reading data in batches , according to JOIN KEY To HASH MAP To find , If hit , Then the data is used as JOIN Output ;
1.jpg

As can be seen from this implementation , If right_table The amount of data exceeds the limit of available memory space of a single machine , be JOIN The operation cannot be completed . Usually , The two tables JOIN when , Use smaller tables as right_table.

2. ClickHouse Distributed JOIN Realization

ClickHouse It's a decentralized architecture , It's very easy to scale the cluster horizontally . When providing services in cluster mode , Distributed JOIN Queries cannot be avoided . Distributed here JOIN Usually refers to ,JOIN The... Involved in the query left_table And right_table It's a distributed table .

Usually , Distributed JOIN The implementation mechanism is nothing more than the following :

  • Broadcast JOIN
  • Shuffle Join
  • Colocate JOIN

ClickHouse The cluster does not achieve a complete sense of Shuffle JOIN, Implemented classes Broadcast JOIN, By completing data redistribution in advance , Can achieve Colocate JOIN.

ClickHouse Distributed JOIN Queries can be divided into two categories , belt GLOBAL Keywords , And without GLOBAL Keyword situation .

2.1 Ordinary JOIN Realization

2.1 Described in GLOBAL JOIN The implementation of the . Next, let's look at none GLOBAL Keywords JOIN How to achieve it :

  • a. initiator take SQL S Replace the local table with the corresponding table in the distributed table , formation S'
  • b. initiator take a. Medium S' Distributed to each node of the cluster
  • c. The cluster node executes S', And summarize the results to initiator node
  • d. initiator The node returns the result to the client

If the right table is a distributed table , Then each node in the cluster will execute distributed queries . There will be a very serious read amplification phenomenon . Suppose the cluster has N Nodes , The right table query will be executed in the cluster N*N Time .

a.png

As shown in the figure , Executive SQL by :

SELECT a_.i, a_.s, b_.t FROM a_all as a_ JOIN b_all AS b_ ON a_.i = b_.i

among ,a_all, b_all For distributed tables , The corresponding local table name is a_local, b_local. Then change SQL The timing of distributed execution is :

  • 1)initiator Received a query request SELECT a_.i, a_.s, b_.t FROM a_local AS a_ JOIN b_all as b_ ON a_.i = b_.i That is, the left distributed table is changed to the local table name . The SQL Execute in parallel within the cluster .
  • 2) initiator Execute distributed queries , This node and other nodes execute
  • 3) The cluster node received 2) in SQL after , When analyzing the right table, the distributed table , Then a distributed query is triggered :SELECT b_.i, b_.t FROM b_local AS b_ Each node of the cluster executes the task concurrently , And merge the results , Write it down as subquery.
  • 4) The cluster node is finished 3) in SQL After execution , perform SELECT a_.i, a_.s, b_.t FROM a_local AS a_ JOIN subquery as b_ ON a_.i = b_.i among subquery Express 2 Results of execution in
  • 5) The execution of each node is completed JOIN After calculation , towards initiator Node sends data

It can be seen that ,ClickHouse General distributed JOIN Query is a simple version of Shuffle JOIN The implementation of the , Or an incomplete implementation . What is incomplete is , Not according to JOIN KEY Go to Shuffle data , Instead, each node pulls all the data in the right table . In fact, there is room for optimization .

In the production environment , The impact of query amplification on query performance can not be ignored .

2.2 GLOBAL JOIN Realization

GLOBAL JOIN The calculation process is as follows :

  • a. If the right table is a subquery , be initiator Complete subquery calculation ;
  • b. initiator Send the data in the right table to other nodes in the cluster ;
  • c. The cluster node compares the data of the left table with that of the local table and the right table JOIN Calculation ;
  • d. Other nodes in the cluster send the results back to initiator node ;
  • e. initiator Summarize the results , Send to client ;

GLOBAL JOIN Can be seen as an incomplete Broadcast JOIN Realization . If JOIN The right table has a large amount of data , It will occupy a lot of network bandwidth , Results in query performance degradation .

b.jpg

As shown in the figure , Executive SQL by :

SELECT a_.i, a_.s, b_.t FROM a_all as a_ GLOBAL JOIN b_all AS b_ ON a_.i = b_.i

among ,a_all, b_all For distributed tables , The corresponding local table name is a_local, b_local. Then change SQL The timing of distributed execution is :

  • 1) initiator Received a query request SELECT b_.i, b_.t FROM b_local AS b_ That is, the left distributed table is changed to the local table name . The SQL Execute in parallel within the cluster . Summarize the results , Record as subquery.
  • 2) initiator And other nodes in the cluster
  • 3)initiator take 2) in subquery Send to other nodes in the cluster , And trigger distributed query :SELECT a_.i, a_.s, b_.t FROM a_local AS a_ JOIN subquery as b_ ON a_.i = b_.i among subquery Express 2) Results of execution in
  • 4) The execution of each node is completed JOIN After calculation , towards initiator Node sends data

It can be seen that ,GLOBAL JOIN Put the query of the right table in initiator After completion on the node , Send to other nodes through the network , Avoid double calculation of other nodes , So as to avoid query amplification .

3. Distributed JOIN Best practices

It's clear ClickHouse Distributed JOIN After the query is implemented , Let's sum up some practical experience .

  • One 、 Try to reduce JOIN The amount of data in the right table

ClickHouse according to JOIN Data in the right table , structure HASH MAP, And will SQL All the required columns in the are read into memory . If the amount of data in the right table is too large , The node memory cannot hold the post , Unable to complete calculation .

In practice, , We usually use the smaller table as the right table , And increase the filtering conditions as much as possible , Reduce entry JOIN The amount of data calculated .

  • Two 、 utilize GLOBAL JOIN Avoid performance loss caused by query amplification

If the amount of data in the right table or sub query is controllable , have access to GLOBAL JOIN To avoid reading and amplifying . It should be noted that ,GLOBAL JOIN Will trigger data propagation between nodes , Occupy part of the network traffic . If the amount of data is large , It also brings performance loss .

  • 3、 ... and 、 Data pre distribution implementation Colocate JOIN

When JOIN When the amount of table data involved is very large , Reading amplification , Or network broadcasting brings huge performance loss , We need to take another way to complete JOIN To calculate the .

according to “ identical JOIN KEY Must be the same piece ” principle , We are going to deal with JOIN Calculated table , Press JOIN KEY Partition in the cluster dimension . Will be distributed JOIN Turn to the local of the node JOIN, Greatly reduce the query amplification problem .

If you do the following :

  • Will involve JOIN Press... For your watch JOIN KEY Fragmentation
  • according to 2.2 Section description , take JOIN It is expected that the middle right table will be replaced with the corresponding local table
c.jpg

As shown in the figure , Executive SQL by :

SELECT a_.i, a_.s, b_.t FROM a_all as a_ JOIN b_local AS b_ ON a_.i = b_.i

among ,a_all, b_all For distributed tables , The corresponding local table name is a_local, b_local. Then change SQL The timing of distributed execution is :

  • 1) initiator Received a query request SELECT a_.i, a_.s, b_.t FROM a_local AS a_ JOIN b_local as b_ ON a_.i = b_.i
  • 2) initiator Launch a distributed query , The native and other nodes execute :
  • 3) The execution of each node is completed JOIN After calculation , towards initiator Node sends data

Due to the data and pre segmentation , same JOIN KEY The corresponding data must be together , Does not exist across nodes , Therefore, there is no need to do distributed query on the right table , You can also get the right results .

4. summary

This paper introduces ClickHouse JOIN Realization principle , Sum up ClickHouse JOIN Best practices :

  • Reduce JOIN The amount of data in the right table
  • Avoid performance loss caused by query amplification
  • Data pre distribution implementation Colocate JOIN;
原网站

版权声明
本文为[fastio]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/06/20210603115758500i.html