当前位置：网站首页>Deep thinking caused by a namenode RPC delay troubleshooting

Deep thinking caused by a namenode RPC delay troubleshooting

2022-06-21 05:50:00 【Meepoljd】

once Namenode Of RPC Deep thinking caused by delayed troubleshooting

Preface
Text

Preface

12 month 1 Japan , Multi provincial HDFS Clusters appear NameNodeRPC Problems with high latency , It was all night , Fuzzy confirmation exists frequently du perhaps count Caused by operation , Later, it was confirmed that the initiator of this operation was the developer python Written collection script , For statistical data , Will execute du command , The fault feels very interesting , Make a note of .

Text

Troubleshoot problems

Preliminary positioning

The screenshot of the process is not completely remembered , As many columns as you can .12 month 1 On Tuesday night 20 I got a call at about , Let me help you check the situation that the provincial cluster tasks can not run , At that time, I looked at the computing resources , No abnormal task or resource occupation is found , But the task is to run . And it takes a long time , According to experience, it was doubting whether it was HDFS A bit of a problem. , So I took a look RPC situation , The delay is high and frightening ：
Insert picture description here
The first reaction was to check the Internet , Confirm that the network is normal , There is no packet loss .NameNode Of GC It's normal, too , As a whole , except RPC high , There are no abnormal indicators . So the whole 30 Minutes of stack information , Look at the flame diagram , The feeling is probably du perhaps count This operation results in , But no specific operation was found , Plus HDFS The audit log is closed , There is no way to start ：
Insert picture description here

Temporary recovery

That night we put the scheduler 、 The computing tasks stopped in turn , Then restart , Still, the crux of the problem has not been found , Because I still stayed up all night , I can't stand it , I didn't continue to watch , The manufacturer entrusted to Party B shall continue to check ;
emmm, I didn't expect this matter to drag on 2 God , In these two days , Stopped some business , The task can run a little , But the cluster as a whole is still abnormal ,RPC Delays have been high ：
Insert picture description here

Locate suspicious processes

So I negotiated with the personnel of the manufacturer , Stop business again , Find the process thoroughly , The dispatching service will not be stopped this time , Stop all access services , Including monitoring services , When the monitoring service for data volume statistics stops , Find out RPC The delay has changed dramatically .
Insert picture description here
RPC The reason why there is a breakpoint in the monitoring diagram is that my monitoring is also stopped at this time ,20:20 We restarted the suspected process , Find out RPC Go high again , And in the 20:30 After we stopped the suspect process ,RPC The delay was immediately reduced ：
Insert picture description here After confirmation with the developers , It is known that the monitoring script will execute du -h The operation of , But in theory, one du Orders shouldn't have such a big impact , The temporary solution is to stop the collection script , The cause of the problem has not been found yet

Problem analysis

I was not in charge of this matter , And then one day , A member of the development department that collected the script came to me , Let me have time to see what's wrong with the script , They are 5 Get the indicator every minute , Normally, there would be no such problem , I promised , It can also be regarded as self-study .

Problem script analysis

The main functions of the script are as follows ：

def main():
    myname = socket.getfqdn(socket.gethostname())
    myip = socket.gethostbyname(myname)
    path = "/home/redpeak/location/jiajie/"
    info = pd.read_csv(path+"qzj_ip.csv",dtype=str)
    nn = info["nn1"].loc[info["ip"]==myip].iloc[0]
    try:
        port = int(9162)
        REGISTRY.register(NameNodeCollector("http://"+nn+":8088/cluster?user.name=admin", 'redpeak'))
        # REGISTRY.register(ResourceManagerNodeCollector(args.url, args.cluster))

        start_http_server(port)
        while True:
            time.sleep(300)
    except KeyboardInterrupt:
        print(" Interrupted")
        exit(0)

In fact, the main logic of the problem script is to write a prometheus Of collect, And then it enabled a http service , It seems that the logic of the developers really wants to 300 Update the indicator every second , But according to his logic, I wrote a test script ：

from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily, REGISTRY
from prometheus_client import start_http_server
from time import sleep
from random import random
class CustomCollector(object):
    def collect(self):
        G=GaugeMetricFamily('my_gauge', 'Help text', value=random())
        yield G
        print(G)
        c = CounterMetricFamily('my_counter_total', 'Help text', labels=['foo'])
        c.add_metric(['bar'], 1.7)
        c.add_metric(['baz'], 3.8)
        yield c
        print(c)
        # sleep Simulate du Delay 
        sleep(10)

REGISTRY.register(CustomCollector())
start_http_server(9163)
while True:
    sleep(300)

The script content is simple , It's about registering a Gauge indicators , Each refresh is a random number , Then I run the script , Access from the browser url Look at the index value directly , Then I found a serious problem , It seems that every time I ask , This value changes once , That is to say, if I send it in one second 10 Requests , Just run 10 Time ？ Then if du The command is over . To further test my idea , Trace the threads ：
Insert picture description here
When the program does not get the request , There are only two resident threads , Then I start to request in the browser , Observe the number of threads at the same time ：

Nima just blew it up .

Problem cause analysis

Okay , Now obviously , The script has bug Of ： Our environment prometheus yes 15 Seconds to get a monitoring indicator , If last time in 15 Second can not get the index ,du Just running backstage , And this time ,prometheus The next index acquisition has started , Called once by , In case of HDFS When you're busy , A lot of du Command backlog , This problem is bound to occur , And for some HDFS For less busy provinces , Because each request can return results quickly , That would not have happened , therefore RPC Delays occur only in individual provinces , But this hidden danger exists all over the country .
Insert picture description here

The code analysis

The cause of the problem has been located in , Next, let's see why this code causes this problem , I wrote a test code , It's also about getting HDFS Specify the data size of the path , Adjust the du command , But I set it up Gauge indicators ：

Test code

from prometheus_client import Gauge,start_http_server
import time
import os
hdfs_exporter_capacityTotal=Gauge("hdfs_exporter_capacityTotal","Space for HDFS configuration.",['provincecode','provincename','type'])

start_http_server(8003)
cmd="'hadoop fs -du -h hdfs://x/x/x/x/x/x/x/x/*"
while True:
    f=os.popen(cmd)
    shuchu = f.read().replace('\n',' ').split(' ')
    suchu=float(shuchu[0])
    hdfs_exporter_capacityTotal.labels(provincecode='province',provincename='provincename',type='servertype').set(suchu)
    time.sleep(15)

My code is 15 Refresh the indicator every second , Instead of going back to refresh the indicators every time you visit, you have to take a look prometheus_client The source of the ;

prometheus_client analysis

Test code

This code is mainly written by imitating the official test code , The main thing is to observe each refresh web When , What is a function call relationship ：

import random
import time
from prometheus_client import Gauge,start_http_server
from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily, REGISTRY

hdfs_exporter_capacityTotal=Gauge("hdfs_exporter_capacityTotal","Space for HDFS configuration.",['provincecode','provincename','type'])

start_http_server(8003)

def random_gauge():
    time.sleep(2)
    return random.random()

class CustomCollector(object):
    def collect(self):
        G=GaugeMetricFamily('my_gauge', 'Help text', value=random_gauge())
        yield G
        print(G)
        c = CounterMetricFamily('my_counter_total', 'Help text', labels=['foo'])
        c.add_metric(['bar'], 1.7)
        c.add_metric(['baz'], 3.8)
        yield c
        print(c)

REGISTRY.register(CustomCollector())


while True:
    suchu=random.random()
    hdfs_exporter_capacityTotal.labels(provincecode='province',provincename='provincename',type='servertype').set(suchu)
    time.sleep(15)

Use it directly Gauge And customization Gauge, Each visit 8003 Ports are only customized Gauge Method will be called , It seems to be from start_http_server How to start , Track this method , According to our debugging results , Locate the process_request Method ：

def process_request(self, request, client_address):
    """Start a new thread to process the request."""
    if self.block_on_close:
        vars(self).setdefault('_threads', _Threads())
    t = threading.Thread(target = self.process_request_thread,
                         args = (request, client_address))
    t.daemon = self.daemon_threads
    self._threads.append(t)
    t.start()

Found the thread to start ：
Insert picture description here

This verifies that each request will start one more thread , Multiple requests are sent , Threads will start many , And the reason why my function doesn't have this problem , Because my business logic is made up of while True The control of the , No matter how many times get request , No corresponding method is bound .