当前位置:网站首页>Production environment tidb cluster capacity reduction tikv operation steps
Production environment tidb cluster capacity reduction tikv operation steps
2022-07-24 14:20:00 【Digital China cloud base】
Catalog
- Preface
- Architecture and background
- The specific process of volume reduction
- matters needing attention
Preface
Recently, I made a cluster TiKV Node shrink operation , I was full of confidence before I started , After all, I just came into contact with TiDB When , This volume reduction operation can be done many times , I thought it was an easy job , But in practice , I still met many points that I didn't notice when I was a beginner .
for instance tikv Of Tombstone state 、 modify PD Parameters 、 stay PD Delete in Tombstone State of TiKV node etc. , When writing operation documents, I encountered all kinds of stumbling , Please consult the elder 、 Check the official website step by step . Fortunately, everything went well in the actual operation . The volume will be reduced here TiKV Share the operation steps of , For beginners like me and students who have not done similar operations . I also hope that if I am lucky enough to be seen by any big guy , Work hard to help me find out and fill in the vacancy !
Architecture and background
Cluster architecture :
Cluster architecture before shrink :9TiDB server + 3PD server + 14TiKV server
Expected cluster architecture after capacity reduction : 9TiDB server + 3PD server + 10TiKV server
Demand background :
Due to the shortage of production environment and resources , After evaluation, the amount of data in the cluster is not very large , You can shrink a few TiKV Nodes are temporarily misappropriated , The amount of data to be clustered increases to a certain extent , And then TiKV Expand the capacity .
It took about... To complete the whole volume reduction step 6 Hour or so , Before learning TiDB When , Make a reduction TiKV The operation of , Maybe less than an hour . The reason why it took so long , Mainly waiting for the cluster balance.
balance This step , Mainly TiKV After the node shrink command is executed , The data in these nodes will be dispatched to other nodes , If there's a lot of data , Just wait a little longer ( When I was doing it , The amount of data in the cluster is 7T about ,balance The phase takes about five hours ).
Next, let me talk about the specific process of volume reduction , There are a few caveats , Or a point that you won't pay attention to when doing experiments , I put it at the end of the article .
The specific process of volume reduction
1、 View existing cluster nodes and their status :
su - tidb
tiup cluster display tidb-test
2、 Confirm the node instance to be reduced :
Confirm that the node to be reduced is 10.3.65.141:20161

3、 Check key indicators
Before volume reduction , Sign in grafana Monitoring interface , Check check cluster region health、region leader Distribution situation , disk io、 Memory 、cpu And whether the cluster load and other key indicators are normal .



( Because the nodes that need to be shrunk are the ones I just tested for cluster expansion , So the monitoring curve is slightly higher , No effect on shrinkage , Let's ignore . If you are operating in a production environment , It is necessary to find out the reasons for the increase of various indicators , Confirm that the treatment is completed and restored to normal or has no impact on the volume reduction before operation )
4、 modify PD Parameters , To speed up the balance speed of progress
/tidb-data1/pd/tidb-deploy/pd-2379/bin/pd-ctl -i -u http://127.0.0.1:2379
· View parameters
» config show » store limit( Keep a copy of the original cluster parameter settings , Modify if there is a problem, you can quickly modify the original parameters )
· Modify the parameters
» config set max-pending-peer-count 256( Control individual store Of pending peer ceiling , Prevent a large number of backward logs on some nodes Region. Need to speed up making up copies or balance The speed can be increased appropriately , Set to 0 It means no limit .)
» config set replica-schedule-limit 512 ( It can be controlled to perform at the same time replica Number of tasks scheduled . This configuration mainly controls the speed of scheduling when a node hangs up or goes offline , The higher the value, the faster the schedule , Set to 0 Turn off scheduling .Replica The cost of scheduling is high , It is generally not recommended to set this value too large , But this is a test cluster , Set the value higher to speed up .)
» store limit all 800 add-peer( Set all store add to peer The upper speed limit of is per minute 800 individual )
» store limit all 20000 remove-peer( Set all store Delete peer The upper speed limit of is per minute 20000 individual )
· If there is a problem , Rollback the original parameter
5、 Start to shrink :
Use screen Tool execution , Because the shrink volume process command may be executed for a long time , Prevent accidental link disconnection , Command execution failed :
screen -S test
tiup cluster scale-in tidb-test -N 10.3.65.141:20161
During command execution , Pay attention to cluster monitoring , Check leader And region Whether to smoothly migrate out of the reduced volume tkv example ,region health、leader region Distribution state , disk io Using a state , Memory usage .
Various monitoring indicators of the cluster before capacity reduction :



All monitoring indicators of the cluster after the shrink command is executed :




As can be seen from the monitoring chart , The cluster has started to migrate replicas , More data , The longer it takes to migrate copies . During copy migration , We need to pay attention to the key indicators of the cluster , Handle problems in time .
6、 After node capacity reduction , Check cluster status :
tiup cluster display tidb-test


Confirm that the status of the reduced node is Tombstone, Log in at the same time grafana Monitoring interface , Get into overview–>TiKV panel , see leader And region Distribution situation , Confirm that the migration replica scheduling is complete .


confirm balance complete , Start executing the cleanup command
tiup cluster prune tidb-test
tiup cluster display tidb-test

So far, the cluster has been shrunk , The shrunk cluster meets the expectation , But it still needs to enter pd Delete tombstone Components , otherwise grafana Monitoring also records tombstone kv.
Check if there is tombstone Components :
./pd-ctl store --state Tombstone

./pd-ctl store remove-tombstone
./pd-ctl store --state Tombstone

Clean up complete :
Sign in grafana Monitor dashboard , Check whether the health indicators of the cluster are normal , Whether the number of nodes meets the expectation .


The cluster component status is normal , The number of nodes meets the expectation .
The final will be PD Adjust the parameters back to the original cluster parameters .
# take PD Adjust the parameters back to the original parameters
/tidb-data/pd/tidb-deploy/pd-2379/bin/pd-ctl -i -u http://127.0.0.1::2379
» config set max-pending-peer-count 16
» config set replica-schedule-limit 64
» store limit all 15 add-peer
» store limit all 15 remove-peer
# Inspection parameters
» config show » store limit
Confirm that the cluster parameters have been modified to the status before resizing . thus ,TiKV The node has been shrunk .
matters needing attention
At the end of my summary , It is important to find some operating steps , Or when a beginner like me is doing experiments , A few points that are easy to miss :
1、 About parameters
Before volume reduction begins , If the data volume of the cluster is large , You can adjust some PD Parameters , To speed up the progress of migrating replicas . The specific parameters vary from person to person , You can go Official documents Look for . The parameters I adjusted , It can also be used for reference , But before adjusting the parameters , You must keep a copy of the original parameter value , In case of any problem in parameter adjustment , Back in time .
2、 About clustering
You need to wait for the cluster to migrate the replica , This step is easy to miss , Because the cluster we use to do experiments , There may not be much data , This step takes a very short time , Don't wait deliberately , Or while doing it , Would not have thought of this step .
3、 About the State
After volume reduction , Need to be in PD The deletion status is Tombstone The node of , otherwise grafana Monitoring also records tombstone kv, In this step, when we are doing the experiment , And I will not pay special attention to it .
thus , I have finished this volume reduction operation , In the middle, I added some understanding when I did it , If you have anything to add , Or I found that I had written something wrong , Welcome to add ~
Copyright notice : This article is organized and written by the team of Digital China cloud base , If reproduced, please indicate the source .
Official account search for digital cloud base in China , The background to reply Odoo, Join in Odoo Technology exchange group !
边栏推荐
- Must use destructuring props assignmenteslint
- 字符串——459. 重复的子字符串
- threw exception [Circular view path [index]: would dispatch back to the current handler URL [/index]
- Attributeerror: module 'distutils' has no attribute' version error resolution
- Usage differences of drop, truncate and delete
- Data analysis and mining 2
- 基于ABP实现DDD--实体创建和更新
- Notes on the use of IEEE transaction journal template
- Was installer startup error
- Maotai ice cream "bucked the trend" and became popular, but its cross-border meaning was not "selling ice cream"
猜你喜欢

Uni app background audio will not be played after the screen is turned off or returned to the desktop

看完这篇文章,才发现我的测试用例写的就是垃圾

IEEE Transaction期刊模板使用注意事项

Beijing all in one card listed and sold 68.45% of its equity at 352.888529 million yuan, with a premium rate of 84%

About the flicker problem caused by using universalimageloader to load pictures and refresh data in recyclerview

【C语言笔记分享】——动态内存管理malloc、free、calloc、realloc、柔性数组
![[oauth2] II. Authorization method of oauth2](/img/9f/0098394a341a9dfb0cf8a862f46049.png)
[oauth2] II. Authorization method of oauth2

Not configured in app.json (uni releases wechat applet)

茅台冰淇淋“逆势”走红,跨界之意却并不在“卖雪糕”

Solve the problem that the ARR containsobject method returns no every time
随机推荐
PCA of [machine learning]
"XXX" cannot be opened because the identity of the developer cannot be confirmed. Or what file has been damaged solution
Video game design report template and resources over the years
Is it safe for Huatai Securities to open an account? Can it be handled on the mobile phone?
Rasa 3.x 学习系列-Rasa [3.2.4] - 2022-07-21 新版本发布
Remove the treasure box app with the green logo that cannot be deleted from iPhone
Differences between C language pointer and array A and &a, &a[0], etc
Was installer startup error
Apache2 ha experiment with raspberry pie
SQL server startup and shutdown job script
threw exception [Circular view path [index]: would dispatch back to the current handler URL [/index]
Don't lose heart. The famous research on the explosive influence of Yolo and PageRank has been rejected by the CS summit
Ztree tree Metro style mouse through the display user-defined controls add, edit, delete, down, up operations
电赛设计报告模板及历年资源
Su Chunyuan, founder of science and technology · CEO of Guanyuan data: making business use is the key to the Bi industry to push down the wall of penetration
Clear all spaces in the string
Caffe framework and production data source for deep learning
Introduction to Xiaoxiong school
Cocoapod installation problems
本机异步网络通信执行快于同步指令