当前位置:网站首页>Production environment tidb cluster capacity reduction tikv operation steps
Production environment tidb cluster capacity reduction tikv operation steps
2022-07-24 14:20:00 【Digital China cloud base】
Catalog
- Preface
- Architecture and background
- The specific process of volume reduction
- matters needing attention
Preface
Recently, I made a cluster TiKV Node shrink operation , I was full of confidence before I started , After all, I just came into contact with TiDB When , This volume reduction operation can be done many times , I thought it was an easy job , But in practice , I still met many points that I didn't notice when I was a beginner .
for instance tikv Of Tombstone state 、 modify PD Parameters 、 stay PD Delete in Tombstone State of TiKV node etc. , When writing operation documents, I encountered all kinds of stumbling , Please consult the elder 、 Check the official website step by step . Fortunately, everything went well in the actual operation . The volume will be reduced here TiKV Share the operation steps of , For beginners like me and students who have not done similar operations . I also hope that if I am lucky enough to be seen by any big guy , Work hard to help me find out and fill in the vacancy !
Architecture and background
Cluster architecture :
Cluster architecture before shrink :9TiDB server + 3PD server + 14TiKV server
Expected cluster architecture after capacity reduction : 9TiDB server + 3PD server + 10TiKV server
Demand background :
Due to the shortage of production environment and resources , After evaluation, the amount of data in the cluster is not very large , You can shrink a few TiKV Nodes are temporarily misappropriated , The amount of data to be clustered increases to a certain extent , And then TiKV Expand the capacity .
It took about... To complete the whole volume reduction step 6 Hour or so , Before learning TiDB When , Make a reduction TiKV The operation of , Maybe less than an hour . The reason why it took so long , Mainly waiting for the cluster balance.
balance This step , Mainly TiKV After the node shrink command is executed , The data in these nodes will be dispatched to other nodes , If there's a lot of data , Just wait a little longer ( When I was doing it , The amount of data in the cluster is 7T about ,balance The phase takes about five hours ).
Next, let me talk about the specific process of volume reduction , There are a few caveats , Or a point that you won't pay attention to when doing experiments , I put it at the end of the article .
The specific process of volume reduction
1、 View existing cluster nodes and their status :
su - tidb
tiup cluster display tidb-test
2、 Confirm the node instance to be reduced :
Confirm that the node to be reduced is 10.3.65.141:20161

3、 Check key indicators
Before volume reduction , Sign in grafana Monitoring interface , Check check cluster region health、region leader Distribution situation , disk io、 Memory 、cpu And whether the cluster load and other key indicators are normal .



( Because the nodes that need to be shrunk are the ones I just tested for cluster expansion , So the monitoring curve is slightly higher , No effect on shrinkage , Let's ignore . If you are operating in a production environment , It is necessary to find out the reasons for the increase of various indicators , Confirm that the treatment is completed and restored to normal or has no impact on the volume reduction before operation )
4、 modify PD Parameters , To speed up the balance speed of progress
/tidb-data1/pd/tidb-deploy/pd-2379/bin/pd-ctl -i -u http://127.0.0.1:2379
· View parameters
» config show » store limit( Keep a copy of the original cluster parameter settings , Modify if there is a problem, you can quickly modify the original parameters )
· Modify the parameters
» config set max-pending-peer-count 256( Control individual store Of pending peer ceiling , Prevent a large number of backward logs on some nodes Region. Need to speed up making up copies or balance The speed can be increased appropriately , Set to 0 It means no limit .)
» config set replica-schedule-limit 512 ( It can be controlled to perform at the same time replica Number of tasks scheduled . This configuration mainly controls the speed of scheduling when a node hangs up or goes offline , The higher the value, the faster the schedule , Set to 0 Turn off scheduling .Replica The cost of scheduling is high , It is generally not recommended to set this value too large , But this is a test cluster , Set the value higher to speed up .)
» store limit all 800 add-peer( Set all store add to peer The upper speed limit of is per minute 800 individual )
» store limit all 20000 remove-peer( Set all store Delete peer The upper speed limit of is per minute 20000 individual )
· If there is a problem , Rollback the original parameter
5、 Start to shrink :
Use screen Tool execution , Because the shrink volume process command may be executed for a long time , Prevent accidental link disconnection , Command execution failed :
screen -S test
tiup cluster scale-in tidb-test -N 10.3.65.141:20161
During command execution , Pay attention to cluster monitoring , Check leader And region Whether to smoothly migrate out of the reduced volume tkv example ,region health、leader region Distribution state , disk io Using a state , Memory usage .
Various monitoring indicators of the cluster before capacity reduction :



All monitoring indicators of the cluster after the shrink command is executed :




As can be seen from the monitoring chart , The cluster has started to migrate replicas , More data , The longer it takes to migrate copies . During copy migration , We need to pay attention to the key indicators of the cluster , Handle problems in time .
6、 After node capacity reduction , Check cluster status :
tiup cluster display tidb-test


Confirm that the status of the reduced node is Tombstone, Log in at the same time grafana Monitoring interface , Get into overview–>TiKV panel , see leader And region Distribution situation , Confirm that the migration replica scheduling is complete .


confirm balance complete , Start executing the cleanup command
tiup cluster prune tidb-test
tiup cluster display tidb-test

So far, the cluster has been shrunk , The shrunk cluster meets the expectation , But it still needs to enter pd Delete tombstone Components , otherwise grafana Monitoring also records tombstone kv.
Check if there is tombstone Components :
./pd-ctl store --state Tombstone

./pd-ctl store remove-tombstone
./pd-ctl store --state Tombstone

Clean up complete :
Sign in grafana Monitor dashboard , Check whether the health indicators of the cluster are normal , Whether the number of nodes meets the expectation .


The cluster component status is normal , The number of nodes meets the expectation .
The final will be PD Adjust the parameters back to the original cluster parameters .
# take PD Adjust the parameters back to the original parameters
/tidb-data/pd/tidb-deploy/pd-2379/bin/pd-ctl -i -u http://127.0.0.1::2379
» config set max-pending-peer-count 16
» config set replica-schedule-limit 64
» store limit all 15 add-peer
» store limit all 15 remove-peer
# Inspection parameters
» config show » store limit
Confirm that the cluster parameters have been modified to the status before resizing . thus ,TiKV The node has been shrunk .
matters needing attention
At the end of my summary , It is important to find some operating steps , Or when a beginner like me is doing experiments , A few points that are easy to miss :
1、 About parameters
Before volume reduction begins , If the data volume of the cluster is large , You can adjust some PD Parameters , To speed up the progress of migrating replicas . The specific parameters vary from person to person , You can go Official documents Look for . The parameters I adjusted , It can also be used for reference , But before adjusting the parameters , You must keep a copy of the original parameter value , In case of any problem in parameter adjustment , Back in time .
2、 About clustering
You need to wait for the cluster to migrate the replica , This step is easy to miss , Because the cluster we use to do experiments , There may not be much data , This step takes a very short time , Don't wait deliberately , Or while doing it , Would not have thought of this step .
3、 About the State
After volume reduction , Need to be in PD The deletion status is Tombstone The node of , otherwise grafana Monitoring also records tombstone kv, In this step, when we are doing the experiment , And I will not pay special attention to it .
thus , I have finished this volume reduction operation , In the middle, I added some understanding when I did it , If you have anything to add , Or I found that I had written something wrong , Welcome to add ~
Copyright notice : This article is organized and written by the team of Digital China cloud base , If reproduced, please indicate the source .
Official account search for digital cloud base in China , The background to reply Odoo, Join in Odoo Technology exchange group !
边栏推荐
- String -- 28. Implement strstr()
- String - Sword finger offer 58 - ii Rotate string left
- SQL server startup and shutdown job script
- Can't remember regular expressions? Here I have sorted out 99 common rules
- Is it safe for Huatai Securities to open an account? Can it be handled on the mobile phone?
- Rasa 3.x learning series -rasa fallbackclassifier source code learning notes
- Rasa 3.x learning series -rasa [3.2.3] - new version released on July 18, 2022
- After five years of contact with nearly 100 bosses, as a headhunter, I found that the secret of promotion was only four words
- Mini examination - examination system
- "XXX" cannot be opened because the identity of the developer cannot be confirmed. Or what file has been damaged solution
猜你喜欢

Nmap security testing tool tutorial

Uni app background audio will not be played after the screen is turned off or returned to the desktop
![Rasa 3.x 学习系列-Rasa [3.2.3] - 2022-07-18 新版本发布](/img/fd/c7bff1ce199e8b600761d77828c674.png)
Rasa 3.x 学习系列-Rasa [3.2.3] - 2022-07-18 新版本发布

Unity 委托 (Delegate) 的简单理解以及实现

Bibliometrix: dig out the one worth reading from thousands of papers!

Concurrent programming ----------- set

The sliding window of Li Kou "step by step" (209. The smallest sub array, 904. Fruit baskets)

Mmdrawercontroller first loading sidebar height problem

Similarities and differences between nor flash and NAND flash

C language -- program environment and preprocessing
随机推荐
Centos7 installs Damon stand-alone database
How to install PHP 5.6 on Ubuntu 18.04 and Debian 9
TypeError: 'str' object does not support item assignment
Number of bytes occupied by variables of type char short int in memory
Class loading mechanism and parental delegation mechanism
mysql
Mini examination - examination system
After reading this article, I found that my test cases were written in garbage
Beijing all in one card listed and sold 68.45% of its equity at 352.888529 million yuan, with a premium rate of 84%
Was installer startup error
Multithreaded common classes
Video game design report template and resources over the years
exchange
Similarities and differences between nor flash and NAND flash
JS judge whether it is an integer
Rasa 3.x 学习系列-Rasa [3.2.4] - 2022-07-21 新版本发布
【C语言笔记分享】——动态内存管理malloc、free、calloc、realloc、柔性数组
Overview of dobesie wavelet (DB wavelet function) in wavelet transform
DDD based on ABP -- Entity creation and update
Summary of week 22-07-23