当前位置:网站首页>Deep thinking caused by a namenode RPC delay troubleshooting
Deep thinking caused by a namenode RPC delay troubleshooting
2022-06-21 05:50:00 【Meepoljd】
once Namenode Of RPC Deep thinking caused by delayed troubleshooting
Preface
12 month 1 Japan , Multi provincial HDFS Clusters appear NameNodeRPC Problems with high latency , It was all night , Fuzzy confirmation exists frequently du perhaps count Caused by operation , Later, it was confirmed that the initiator of this operation was the developer python Written collection script , For statistical data , Will execute du command , The fault feels very interesting , Make a note of .
Text
Troubleshoot problems
Preliminary positioning
The screenshot of the process is not completely remembered , As many columns as you can .12 month 1 On Tuesday night 20 I got a call at about , Let me help you check the situation that the provincial cluster tasks can not run , At that time, I looked at the computing resources , No abnormal task or resource occupation is found , But the task is to run . And it takes a long time , According to experience, it was doubting whether it was HDFS A bit of a problem. , So I took a look RPC situation , The delay is high and frightening :
The first reaction was to check the Internet , Confirm that the network is normal , There is no packet loss .NameNode Of GC It's normal, too , As a whole , except RPC high , There are no abnormal indicators . So the whole 30 Minutes of stack information , Look at the flame diagram , The feeling is probably du perhaps count This operation results in , But no specific operation was found , Plus HDFS The audit log is closed , There is no way to start :
Temporary recovery
That night we put the scheduler 、 The computing tasks stopped in turn , Then restart , Still, the crux of the problem has not been found , Because I still stayed up all night , I can't stand it , I didn't continue to watch , The manufacturer entrusted to Party B shall continue to check ;
emmm, I didn't expect this matter to drag on 2 God , In these two days , Stopped some business , The task can run a little , But the cluster as a whole is still abnormal ,RPC Delays have been high :
Locate suspicious processes
So I negotiated with the personnel of the manufacturer , Stop business again , Find the process thoroughly , The dispatching service will not be stopped this time , Stop all access services , Including monitoring services , When the monitoring service for data volume statistics stops , Find out RPC The delay has changed dramatically .
RPC The reason why there is a breakpoint in the monitoring diagram is that my monitoring is also stopped at this time ,20:20 We restarted the suspected process , Find out RPC Go high again , And in the 20:30 After we stopped the suspect process ,RPC The delay was immediately reduced :
After confirmation with the developers , It is known that the monitoring script will execute du -h The operation of , But in theory, one du Orders shouldn't have such a big impact , The temporary solution is to stop the collection script , The cause of the problem has not been found yet
Problem analysis
I was not in charge of this matter , And then one day , A member of the development department that collected the script came to me , Let me have time to see what's wrong with the script , They are 5 Get the indicator every minute , Normally, there would be no such problem , I promised , It can also be regarded as self-study .
Problem script analysis
The main functions of the script are as follows :
def main():
myname = socket.getfqdn(socket.gethostname())
myip = socket.gethostbyname(myname)
path = "/home/redpeak/location/jiajie/"
info = pd.read_csv(path+"qzj_ip.csv",dtype=str)
nn = info["nn1"].loc[info["ip"]==myip].iloc[0]
try:
port = int(9162)
REGISTRY.register(NameNodeCollector("http://"+nn+":8088/cluster?user.name=admin", 'redpeak'))
# REGISTRY.register(ResourceManagerNodeCollector(args.url, args.cluster))
start_http_server(port)
while True:
time.sleep(300)
except KeyboardInterrupt:
print(" Interrupted")
exit(0)
In fact, the main logic of the problem script is to write a prometheus Of collect, And then it enabled a http service , It seems that the logic of the developers really wants to 300 Update the indicator every second , But according to his logic, I wrote a test script :
from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily, REGISTRY
from prometheus_client import start_http_server
from time import sleep
from random import random
class CustomCollector(object):
def collect(self):
G=GaugeMetricFamily('my_gauge', 'Help text', value=random())
yield G
print(G)
c = CounterMetricFamily('my_counter_total', 'Help text', labels=['foo'])
c.add_metric(['bar'], 1.7)
c.add_metric(['baz'], 3.8)
yield c
print(c)
# sleep Simulate du Delay
sleep(10)
REGISTRY.register(CustomCollector())
start_http_server(9163)
while True:
sleep(300)
The script content is simple , It's about registering a Gauge indicators , Each refresh is a random number , Then I run the script , Access from the browser url Look at the index value directly , Then I found a serious problem , It seems that every time I ask , This value changes once , That is to say, if I send it in one second 10 Requests , Just run 10 Time ? Then if du The command is over . To further test my idea , Trace the threads :
When the program does not get the request , There are only two resident threads , Then I start to request in the browser , Observe the number of threads at the same time :
Nima just blew it up .
Problem cause analysis
Okay , Now obviously , The script has bug Of : Our environment prometheus yes 15 Seconds to get a monitoring indicator , If last time in 15 Second can not get the index ,du Just running backstage , And this time ,prometheus The next index acquisition has started , Called once by , In case of HDFS When you're busy , A lot of du Command backlog , This problem is bound to occur , And for some HDFS For less busy provinces , Because each request can return results quickly , That would not have happened , therefore RPC Delays occur only in individual provinces , But this hidden danger exists all over the country .
The code analysis
The cause of the problem has been located in , Next, let's see why this code causes this problem , I wrote a test code , It's also about getting HDFS Specify the data size of the path , Adjust the du command , But I set it up Gauge indicators :
Test code
from prometheus_client import Gauge,start_http_server
import time
import os
hdfs_exporter_capacityTotal=Gauge("hdfs_exporter_capacityTotal","Space for HDFS configuration.",['provincecode','provincename','type'])
start_http_server(8003)
cmd="'hadoop fs -du -h hdfs://x/x/x/x/x/x/x/x/*"
while True:
f=os.popen(cmd)
shuchu = f.read().replace('\n',' ').split(' ')
suchu=float(shuchu[0])
hdfs_exporter_capacityTotal.labels(provincecode='province',provincename='provincename',type='servertype').set(suchu)
time.sleep(15)
My code is 15 Refresh the indicator every second , Instead of going back to refresh the indicators every time you visit, you have to take a look prometheus_client The source of the ;
prometheus_client analysis
Test code
This code is mainly written by imitating the official test code , The main thing is to observe each refresh web When , What is a function call relationship :
import random
import time
from prometheus_client import Gauge,start_http_server
from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily, REGISTRY
hdfs_exporter_capacityTotal=Gauge("hdfs_exporter_capacityTotal","Space for HDFS configuration.",['provincecode','provincename','type'])
start_http_server(8003)
def random_gauge():
time.sleep(2)
return random.random()
class CustomCollector(object):
def collect(self):
G=GaugeMetricFamily('my_gauge', 'Help text', value=random_gauge())
yield G
print(G)
c = CounterMetricFamily('my_counter_total', 'Help text', labels=['foo'])
c.add_metric(['bar'], 1.7)
c.add_metric(['baz'], 3.8)
yield c
print(c)
REGISTRY.register(CustomCollector())
while True:
suchu=random.random()
hdfs_exporter_capacityTotal.labels(provincecode='province',provincename='provincename',type='servertype').set(suchu)
time.sleep(15)
Use it directly Gauge And customization Gauge, Each visit 8003 Ports are only customized Gauge Method will be called , It seems to be from start_http_server How to start , Track this method , According to our debugging results , Locate the process_request Method :
def process_request(self, request, client_address):
"""Start a new thread to process the request."""
if self.block_on_close:
vars(self).setdefault('_threads', _Threads())
t = threading.Thread(target = self.process_request_thread,
args = (request, client_address))
t.daemon = self.daemon_threads
self._threads.append(t)
t.start()
Found the thread to start :

This verifies that each request will start one more thread , Multiple requests are sent , Threads will start many , And the reason why my function doesn't have this problem , Because my business logic is made up of while True The control of the , No matter how many times get request , No corresponding method is bound .
边栏推荐
- At the codeless Explorer conference, Qingliu invites you to discuss the way of digital transformation practice
- Network security - the difference between DDoS and CC attacks
- Q & A: issues related to "micro build low code" billing
- Splunk Enterprise 8.2.5
- Music genre classification based on CNN
- Research and Analysis on the current situation of China's wireless multi room speaker Market and forecast report on its development prospect (2022)
- el-table去掉滚动条和斑马纹颜色修改
- Librosa 𞓜 the most humorous explanation of Mel spectrum
- Program optimization with multi-core and multi thread
- Armcm3 authoritative guide notes - the impact of address misalignment in arm programming
猜你喜欢

Canvas makes classic Snake

At the codeless Explorer conference, Qingliu invites you to discuss the way of digital transformation practice

js 封装函数,多次调用结果追加

力扣------移除字母异位词后的结果数组

PhpMyAdmin (MySQL) common shell writing methods

C#常用Chart组件

The Mac OS MAMP installs redis with an error/ common. h:12:10: fatal error: ‘zend_ smart_ str.h‘ file not found

A comprehensive interpretation of the gold content, detailed introduction and test strategy of CKA certification

Huashao, founder of Kechuang · kuxuan Technology: make products with win-win thinking, connect ecology, and realize large-scale development

【Grafana】Grafana Mimir在海量时间序列指标中的优化
随机推荐
Sentinel熔断限流真的太丝滑了
Application of JS promise
基于注意力的seq2seq模型
进程间通信(IPC):信号量
Use of mysqldump in MySQL
MySQL MySQL mysqldump data backup and incremental backup
应用在电视触摸屏中的十四通道智能触摸芯片
build opencv3.4.16
GCC警告选项理解
Renren fast start prompt process env. NODE_ ENV
build opencv3.4.16
After the code cloud creates a warehouse and associates it, the first submission always fails
Principle and implementation of instanceof
arm权威指南及我们组项目笔记
NFT platform track dark horse takes advantage of the situation
PhpMyAdmin (MySQL) common shell writing methods
Matlab GUI编程技巧(十九)【实战二】:超强科学计算器的设计,功能齐全
Things to think about before using mysqldump
Table space usage query of Oracle notes
Embedded programming complexity