当前位置：网站首页>Clickhouse optimize table comprehensive analysis

Clickhouse optimize table comprehensive analysis

2022-06-24 03:14:00 【2011aad】

Recently I am using Clickhouse In the process of , Yes Optimize Table command , In the process of business development , Because I don't understand Optimize Table An explicit act of command , A lot of things went wrong , In the process of checking the problem , Also found online about Optimize Table There is little introduction to the command , So I decided to combine the source code , Comprehensive analysis Optimize Table command .

Optimize Table Command functionality

Clickhouse As a OLAP database , Support for data updates is weak , And does not support the standard SQL update/delete grammar ; It provides alter table ...... update/delete The syntax is also asynchronous , That is, after receiving the command, it will be returned to the client successfully , It is uncertain when the data will be updated successfully .

Therefore, when the business needs data update （ Such as Mysql Synchronize to Clickhouse）, You usually use ReplacingMergeTree or CollapsingMergeTree Data merge logic bypasses to realize asynchronous update , On the one hand, it can ensure the final consistency of data , On the other hand Clickhouse The performance overhead is also lower than alter table Small . But one drawback of this approach is MergeTree The data merging process of the engine （merge） yes Clickhouse Policy based control , The execution time is random , Therefore, data consistency lacks time guarantee , In extreme cases, the data has not been fully merged after a day .

and Optimize Table This command can force the trigger MergeTree Data consolidation of the engine , It can be used to solve the problem of uncertain data consolidation time .

Optimize Table Execution process source code analysis

Clickhouse Receiving a SQL After the statement , It will be implemented through the following process SQL：Parser（ analysis SQL grammar , Turn into AST）-> Interpreter（ Optimize the generation of execution plans RBO）-> Interpreter::executeImpl（ adopt Block Stream Read or write data ）[1].Optimize Table Statements are no exception , It's just Optimize Statement has no complex execution plan .

Figure 1 Clickhouse SQL Execute the process

Clickhouse received Optimize Table Command will be called to ParserOptimizeQuery::parseImpl() Parse command .

bool ParserOptimizeQuery::parseImpl(Pos & pos, ASTPtr & node, Expected & expected)
{
    ParserKeyword s_optimize_table("OPTIMIZE TABLE");
    ParserKeyword s_partition("PARTITION");
    ParserKeyword s_final("FINAL");
    ParserKeyword s_deduplicate("DEDUPLICATE");
    ParserKeyword s_by("BY");
    ......
}

You can see Optimize Table The following keywords are mainly parsed in the statement ：“OPTIMIZE TABLE”、“PARTITION”、“FINAL”、“DEDUPLICATE”、“BY”. Official documents describe the role of these keywords [2]：

1. “OPTIMIZE TABLE”： Specify the need Optimize Table of , Only support MergeTree engine .

2. “PARTITION”： If partition is specified , The merge task will only be triggered for the specified partition .

3. “FINAL”： Merge even if there is only one file block , Even if there is a parallel merge in progress , This merge will also be enforced .

4. “DEDUPLICATE”： duplicate removal , If there is no follow-up “BY” Clause , Then remove the duplicate according to the same lines （ All field values are the same ）.

5. “BY”： coordination “DEDUPLICATE” Key words use , Specify which columns are used for de duplication .

Next, compare the source code , See how these keywords control merge execution .

Get into InterpreterOptimizeQuery::execute(), Check it first “DEDUPLICATE BY” Whether the column of contains the partitioning key of the table 、 Primary key , If not, an exception will be thrown directly .Clickhouse The data storage of is divided into file blocks according to the partition key , The data in each file block is sorted by primary key , Therefore, if the partition key is included in the de duplication 、 Primary key ,Clickhouse You can de duplicate only the adjacent rows , There is no need to construct a hash table , It can greatly improve the execution efficiency .

BlockIO InterpreterOptimizeQuery::execute()
{
    ......
    // Empty list of names means we deduplicate by all columns, but user can explicitly state which columns to use.
    Names column_names;
    if (ast.deduplicate_by_columns)
    {
        ......
        metadata_snapshot->check(column_names, NamesAndTypesList{}, table_id);
        Names required_columns;
        {
            required_columns = metadata_snapshot->getColumnsRequiredForSortingKey();
            const auto partitioning_cols = metadata_snapshot->getColumnsRequiredForPartitionKey();
            required_columns.reserve(required_columns.size() + partitioning_cols.size());
            required_columns.insert(required_columns.end(), partitioning_cols.begin(), partitioning_cols.end());
        }
        for (const auto & required_col : required_columns)
        {
            // Deduplication is performed only for adjacent rows in a block,
            // and all rows in block are in the sorting key order within a single partition,
            // hence deduplication always implicitly takes sorting keys and partition keys in account.
            // So we just explicitly state that limitation in order to avoid confusion.
            if (std::find(column_names.begin(), column_names.end(), required_col) == column_names.end())
                throw Exception(ErrorCodes::THERE_IS_NO_COLUMN,
                        "DEDUPLICATE BY expression must include all columns used in table's"
                        " ORDER BY, PRIMARY KEY, or PARTITION BY but '{}' is missing."
                        " Expanded DEDUPLICATE BY columns expression: ['{}']",
                        required_col, fmt::join(column_names, "', '"));
        }
    }

    table->optimize(query_ptr, metadata_snapshot, ast.partition, ast.final, ast.deduplicate, column_names, getContext());

    return {};
}

After verifying the de duplication , The table's optimize() Method . In fact only MergeTree and ReplicatedMergeTree Realized optimize() Method , Other storage engine calls optimize() Methods will throw exceptions directly .

Get into InterpreterOptimizeQuery::optimize(), In unspecified “PARTITION” And used “FINAL” when , Will traverse all partitions of the table , And perform merge logic for each partition ; If a partition is specified , I don't care anymore “FINAL” The key words , Is to merge the partition ; If no partition is specified , Not used “FINAL” Under the circumstances , In code partition_id It's empty , stay merge() Special treatment is made for this situation in the method .

bool StorageMergeTree::optimize(
    const ASTPtr & /*query*/,
    const StorageMetadataPtr & /*metadata_snapshot*/,
    const ASTPtr & partition,
    bool final,
    bool deduplicate,
    const Names & deduplicate_by_columns,
    ContextPtr local_context)
{
    ......
    String disable_reason;
    if (!partition && final)
    {
        DataPartsVector data_parts = getDataPartsVector();
        std::unordered_set<String> partition_ids;

        for (const DataPartPtr & part : data_parts)
            partition_ids.emplace(part->info.partition_id);

        for (const String & partition_id : partition_ids)
        {
            if (!merge(
                    true,
                    partition_id,
                    true,
                    deduplicate,
                    deduplicate_by_columns,
                    &disable_reason,
                    local_context->getSettingsRef().optimize_skip_merged_partitions))
            {......}
        }
    }
    else
    {
        String partition_id;
        if (partition)
            partition_id = getPartitionIDFromQuery(partition, local_context);

        if (!merge(
                true,
                partition_id,
                final,
                deduplicate,
                deduplicate_by_columns,
                &disable_reason,
                local_context->getSettingsRef().optimize_skip_merged_partitions))
        {......}
    }

    return true;
}

InterpreterOptimizeQuery::merge() The logic of the method is simple , Select the file blocks to merge -> Merge selected file blocks .

bool StorageMergeTree::merge(
    bool aggressive,
    const String & partition_id,
    bool final,
    bool deduplicate,
    const Names & deduplicate_by_columns,
    String * out_disable_reason,
    bool optimize_skip_merged_partitions)
{
    ......
    {
        merge_mutate_entry = selectPartsToMerge(
            metadata_snapshot,
            aggressive,
            partition_id,
            final,
            out_disable_reason,
            table_lock_holder,
            lock,
            optimize_skip_merged_partitions,
            &select_decision);
    }
    ......
    return mergeSelectedParts(metadata_snapshot, deduplicate, deduplicate_by_columns, *merge_mutate_entry, table_lock_holder);
}

Get into StorageMergeTree::selectPartsToMerge(), stay partition_id It's empty time （ Only if no partition is specified , Without using “FINAL” when , Will be empty ）, Will execute selectPartsToMerge() Select some file blocks according to the policy to perform the merge , And in the partition_id When is not empty , It's execution selectAllPartsToMergeWithinPartition() Merge all the file blocks under the partition . therefore , No partition is specified , Not used “FINAL” In the case of keywords ,Optimize Table The command does not guarantee that the data will eventually become fully merged .

std::shared_ptr<StorageMergeTree::MergeMutateSelectedEntry> StorageMergeTree::selectPartsToMerge(
    const StorageMetadataPtr & metadata_snapshot,
    bool aggressive,
    const String & partition_id,
    bool final,
    String * out_disable_reason,
    TableLockHolder & /* table_lock_holder */,
    std::unique_lock<std::mutex> & lock,
    bool optimize_skip_merged_partitions,
    SelectPartsDecision * select_decision_out)
{
    ......
    if (partition_id.empty())
    {
        ......
        if (max_source_parts_size > 0)
        {
            select_decision = merger_mutator.selectPartsToMerge(
                future_part,
                aggressive,
                max_source_parts_size,
                can_merge,
                merge_with_ttl_allowed,
                out_disable_reason);
        }
        else if (out_disable_reason)
            *out_disable_reason = "Current value of max_source_parts_size is zero";
    }
    else
    {
        while (true)
        {
            UInt64 disk_space = getStoragePolicy()->getMaxUnreservedFreeSpace();
            select_decision = merger_mutator.selectAllPartsToMergeWithinPartition(
                future_part, disk_space, can_merge, partition_id, final, metadata_snapshot, out_disable_reason, optimize_skip_merged_partitions);
            auto timeout_ms = getSettings()->lock_acquire_timeout_for_background_operations.totalMilliseconds();
            auto timeout = std::chrono::milliseconds(timeout_ms);

            /// If final - we will wait for currently processing merges to finish and continue.
            if (final
                && select_decision != SelectPartsDecision::SELECTED
                && !currently_merging_mutating_parts.empty()
                && out_disable_reason
                && out_disable_reason->empty())
            {
                LOG_DEBUG(log, "Waiting for currently running merges ({} parts are merging right now) to perform OPTIMIZE FINAL",
                    currently_merging_mutating_parts.size());

                if (std::cv_status::timeout == currently_processing_in_background_condition.wait_for(lock, timeout))
                {
                    *out_disable_reason = fmt::format("Timeout ({} ms) while waiting for already running merges before running OPTIMIZE with FINAL", timeout_ms);
                    break;
                }
            }
            else
                break;
        }
    }
    ......
}

in addition , In the use of the “FINAL” In the case of keywords ,Optimize Table The command will wait for the merging task being executed to end , Then execute the merge , So when the partition is specified , Use “FINAL” Keyword response will be slower .

InterpreterOptimizeQuery::mergeSelectedParts() The logic of is more complicated , I won't go into details here , But the overall logic is to read all the selected file blocks , Then perform data consolidation , Form a new file block to write to disk . Therefore, in the case of a large amount of data , This is actually a very heavy operation , Because whether or not there is data to be merged , It is necessary to read out the full amount of data , Write a new copy to disk . In execution Optimize after , A new file block is generated , But old file blocks don't disappear immediately , It will be deleted asynchronously , Therefore, when executing the Optimize After that, you will see a brief increase in data storage capacity .

Some partitions do not need to be merged ,Clickhouse 21.1 The version has been optimized , In system variable （system.settings surface ） It's added optimize_skip_merged_partitions Parameters , This parameter turns on , stay selectAllPartsToMergeWithinPartition() Only one file block and level>0 The partition （ Such a partition means that the partition has been merged before ）.

Experimental verification

To verify the above code logic , The author is in Clickhouse 20.3 edition （ No, optimize_skip_merged_partitions Parameters ） Some experiments were carried out on .

1. Optimize + Partition

Figure 2 shows the execution Optimize Table ...... Partition 20210209 The implementation effect of , You can see that after execution 20210209 In this partition 2 File block （Parts） Is merged into a file block , Its level by 3, Other partitions are not merged . Of course, the picture shows Optimize Final effect , When the command is just executed , The original 20210209_84_94_2、20210209_95_95_0 Folders don't disappear immediately , It took a few minutes before it was deleted .

Figure 2 Optimize Partition Execution effect

2. Optimize + Final

Figure 3 is Optimize Table ...... Final The implementation effect of , You can see the execution Optimize Final After the command ,20211013 Multiple file blocks of this partition are merged into one file block ; meanwhile , Other partitions that have been merged （ Such as 20210729） Will be rewritten , Its level from 5 Change into 7（ Because in the middle 2 Time Optimize Final sentence ）.

Figure 3 Optimize + Final Execution effect

3. Optimize

Finally, let's take a look at the simple Optimize The effect of , As shown in Figure 4 . You can see Clickhouse Only some file blocks of a partition are selected for merging according to the policy （20211013_0_231_28、20211013_232_410_30、20211013_411_432_10 Three file blocks are merged into 20211013_0_432_31 File block ）, This does not guarantee that the final data will be fully merged .

Figure 4 Optimize Execution effect

Use summary

Based on Clickhouse In the construction of data warehouse , because Clickhouse It does not support complete data update , Real time and consistency of data exist trade-off, If the application scenario requires high data consistency , In case of data update , It is almost impossible to import data in real time , You can only periodically import offline to ensure Clickhouse The data in is a complete slice at a certain time . The offline task has scheduling delay , Generally speaking, the minimum cycle can only reach the hour level , It's hard to be on the minute scale . If the application scenario pays more attention to the real-time data , You can import in real time , because Clickhouse Of Merge The process is scheduled based on policy , Therefore, the data consistency will be poor （ You will find the data that should have been deleted ）.

Based on real-time writing + regular Optimize The way , By changing Optimize cycle , In performance 、 Balance data consistency . When data consistency requirements are high , Can shorten Optimize cycle , In extreme cases, you can even execute every write Optimize, This can reduce the time of data inconsistency to minutes （ Of course that's right Clickhouse The performance requirements of are strict ）; When the amount of data is large , It can be executed every half an hour or so Optimize, This is a guarantee Clickhouse Cluster performance at the same time , There is also a guarantee for the time when the data are inconsistent . In my practical use ,Clickhouse Cluster use 32 nucleus 64G machine , The original data volume of a single table is 1TB Within ,Optimize The execution cycle is 5min-10min There's no pressure .