当前位置:网站首页>Location and troubleshooting of memory leakage: analysis of heap profiling principle

Location and troubleshooting of memory leakage: analysis of heap profiling principle

2022-06-24 01:25:00 PingCAP

After the system runs for a long time , Less and less memory available , It even led to the failure of some services , This is a typical memory leak problem . Such problems are often difficult to predict , It is also difficult to locate by static code sorting .Heap Profiling Is to help us solve such problems .

TiKV As part of a distributed system , Have initially owned Heap Profiling The ability of . This article will introduce some common Heap Profiler The realization principle and use method of , Help readers understand more easily TiKV In this paper, we discuss the implementation of this method , Or better apply this kind of analysis method to your own project .

What is? Heap Profiling

Memory leakage at runtime is quite difficult to troubleshoot in many scenarios , Because such problems are often difficult to predict , It is also difficult to locate by static code sorting .

Heap Profiling Is to help us solve such problems .

Heap Profiling It usually refers to collecting or sampling the heap allocation of an application , To report the memory usage of the program , In order to analyze the cause of memory occupation or locate the root cause of memory leakage .

Heap Profiling How it works

As a contrast , Let's take a quick look CPU Profiling How it works .

When we are ready to CPU Profiling when , Usually you need to select a Time window , In this window ,CPU Profiler It registers a scheduled execution with the target program hook( There are many means , for example SIGPROF The signal ), In this hook We will get the current... Of the business thread every time stack trace.

We will hook The execution frequency is controlled at a specific value , for example 100hz, In this way, every 10ms Collect a call stack sample of business code . When the time window is over , We aggregate all the samples collected , Finally, the number of times each function is collected , Compared with the total number of samples, we get the of each function Relative proportion .

With the help of this model, we can find the function with high proportion , And then positioning CPU hotspot .

In terms of data structure ,Heap Profiling And CPU Profiling Very similar , All are stack trace + statistics Model of . If you have Go Provided pprof, You will find that the display format of the two is almost the same :

Go CPU Profile

Go Heap Profile

And CPU Profiling The difference is ,Heap Profiling The data acquisition of is not simply carried out through a timer , Instead, you need to invade the memory allocation path , In this way, you can get the amount of memory allocated . therefore Heap Profiler The usual way is Integrate yourself directly into the memory allocator , When the application allocates memory, it gets the current stack trace, Finally, all samples are aggregated together , So we can know Each function allocates the amount of memory directly or indirectly 了 .

Heap Profile Of stack trace + statistics Data model and CPU Proflie It's consistent .

Next, we will introduce several Heap Profiler The use and implementation principle of .

notes : Such as GNU gprofValgrind The usage scenario of such tools does not match our target scenario , Therefore, this article will not expand . Reason reference gprof, Valgrind and gperftools - an evaluation of some tools for application level CPU profiling on Linux - Gernot.Klingler.

Heap Profiling in Go

Most readers should be right about Go Will be more familiar with , So we have Go As the starting point and base for research .

notes : If a concept we talked about in the previous section , The following sections will not repeat , Even if they are not the same project . In addition, for integrity purposes , Each project is equipped with usage Section to explain its usage , Students who are already familiar with this can skip .

Usage

Go runtime Built in convenient profiler,heap Is one of the types . We can open a debug port :

import _ "net/http/pprof"

go func() {
   log.Print(http.ListenAndServe("0.0.0.0:9999", nil))
}()
 Then use the command line to get the current... While the program is running  Heap Profiling  snapshot :$ go tool pprof http://127.0.0.1:9999/debug/pprof/heap
 Or you can get it directly once at a specific location in the application code  Heap Profiling  snapshot :import "runtime/pprof"

pprof.WriteHeapProfile(writer)
 Here we use a complete  demo  Let's string  heap pprof  Usage of :package main

import (
 "log"
 "net/http"
 _ "net/http/pprof"
 "time"
)

func main() {
 go func() {
  log.Fatal(http.ListenAndServe(":9999", nil))
 }()

 var data [][]byte
 for {
  data = func1(data)
  time.Sleep(1 * time.Second)
 }
}

func func1(data [][]byte) [][]byte {
 data = func2(data)
 return append(data, make([]byte, 1024*1024)) // alloc 1mb
}

func func2(data [][]byte) [][]byte {
 return append(data, make([]byte, 1024*1024)) // alloc 1mb
}
 The code continues to  func1  and  func2  Allocate memory separately , A total of... Are allocated per second  2mb  Heap memory .
 After running the program for a period of time , Execute the following command to get  profile  Take a snapshot and open a  web  Services to browse :$ go tool pprof -http=":9998" localhost:9999/debug/pprof/heap
Go Heap Graph From the figure, we can intuitively see which functions account for the majority of memory allocation ( The box is bigger ), At the same time, you can also intuitively see the function call relationship ( By wire ). For example, it is obvious in the figure above that  func1  and  func2  The distribution of accounts for the majority , And  func2  By  func1  call .
 Be careful , because  Heap Profiling  Also sampled ( By default, every allocation  512k  Take a sample ), So the memory size shown here is smaller than the actual allocated memory size . Same as  CPU Profiling  equally , This value is only used to calculate the relative proportion , Then locate the memory allocation hotspot .
 notes : in fact ,Go runtime  There is logic to estimate the original size of the sampled results , But this conclusion is not necessarily accurate .
 Besides ,func1  In the box  48.88% of 90.24%  Express  Flat% of Cum%.
 What is?  Flat%  and  Cum%? Let's change the way we browse , In the upper left corner  View  Click... From the drop-down bar  Top:
Go Heap TopName  The list shows the corresponding function name Flat  The list shows how much memory is allocated by the function itself Flat%  Column means  Flat  Proportion of relative total allocation size Cum  The list shows the function , And all the sub functions it calls Cum%  Column means  Cum  Proportion of relative total allocation size 
Sum%  The list shows top-down  Flat%  Accumulation ( You can intuitively judge from which line up how much memory is allocated ) The above two methods can help us locate specific functions ,Go  Provides finer grained allocation source statistics at the line count level , In the upper left corner  View  Click... From the drop-down bar  Source:Go Heap Source stay  CPU Profiling  In, we often use the flame diagram to find the wide top to locate the hot spot function quickly and intuitively . Of course , Due to the homogeneity of the data model ,Heap Profiling  The data can also be displayed by flame diagram , In the upper left corner  View  Click... From the drop-down bar  Flame Graph:Go Heap Flamegraph Through the above methods, we can easily see that the main memory allocation is  func1  and  func2. However, in the real scene, it will never be so simple for us to locate the root cause of the problem , Because we got a snapshot of a moment , This is not enough for memory leaks , What we need is an incremental data , To determine which memory is growing continuously . So you can get it again after a certain interval  Heap Profile, Do... On the two results  diff.
Implementation details In this section we focus on  Go Heap Profiling  Implementation principle of .
 review  “Heap Profiling  How it works ”  section ,Heap Profiler  The usual approach is to integrate yourself directly into the memory allocator , When the application allocates memory, it gets the current  stack trace, and  Go  That's what it did .
Go  The memory allocation entry for is  src/runtime/malloc.go  Medium  mallocgc()  function , One of the key codes is as follows :func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
 // ...
 if rate := MemProfileRate; rate > 0 {
  // Note cache c only valid while m acquired; see #47302
  if rate != 1 && size < c.nextSample {
   c.nextSample -= size
  } else {
   profilealloc(mp, x, size)
  }
 }
 // ...
}

func profilealloc(mp *m, x unsafe.Pointer, size uintptr) {
 c := getMCache()
 if c == nil {
  throw("profilealloc called without a P or outside bootstrapping")
 }
 c.nextSample = nextSample()
 mProf_Malloc(x, size)
}
 That means , Every time I pass  mallocgc()  Distribute  512k  Heap memory for , Just call  profilealloc()  Record once  stack trace.

 Why do you need to define a sampling granularity ? Every time  mallocgc()  Record the current  stack trace  Isn't it more accurate ?
 It seems more attractive to get the memory allocation of all functions completely and accurately , But the performance overhead is huge .malloc()  As a user state library function, it will be called very frequently by applications , Optimizing memory allocation performance is  allocator  The responsibility of the . If every time  malloc()  Calls are accompanied by a stack backtrace , The cost is almost unacceptable , Especially on the server side, it continues for a long time  profiling  Scene . choice  “ sampling ”  Not better results , It's just a compromise .
 Of course , We can also modify it ourselves  MemProfileRate  Variable , Set it to  1  Will cause every time  mallocgc()  There must be  stack trace  Record , Set to  0  Will shut down completely  Heap Profiling, Users can weigh performance against accuracy based on actual scenarios .
 Be careful , When we will  MemProfileRate  When set to a normal sampling granularity , This value is not entirely accurate , But every time with  MemProfileRate  Take a random value from the exponential distribution of the mean .// nextSample returns the next sampling point for heap profiling. The goal is
// to sample allocations on average every MemProfileRate bytes, but with a
// completely random distribution over the allocation timeline; this
// corresponds to a Poisson process with parameter MemProfileRate. In Poisson
// processes, the distance between two samples follows the exponential
// distribution (exp(MemProfileRate)), so the best return value is a random
// number taken from an exponential distribution whose mean is MemProfileRate.
func nextSample() uintptr
 Because memory allocation is regular in many cases , If sampling is carried out according to a fixed particle size , The final result may have great error , It may happen that each sampling catches up with a specific type of memory allocation . That's why randomization is chosen here .
 It's not just  Heap Profiling, be based on  sampling  All kinds of  profiler  There will always be some errors ( example :SafePoint Bias), Based on  sampling  Of  profiling  As a result , You need to remind yourself not to ignore the possibility of errors .
 be located  src/runtime/mprof.go  Of  mProf_Malloc()  Function is responsible for specific sampling work :// Called by malloc to record a profiled block.
func mProf_Malloc(p unsafe.Pointer, size uintptr) {
 var stk [maxStack]uintptr
 nstk := callers(4, stk[:])
 lock(&proflock)
 b := stkbucket(memProfile, size, stk[:nstk], true)
 c := mProf.cycle
 mp := b.mp()
 mpc := &mp.future[(c+2)%uint32(len(mp.future))]
 mpc.allocs++
 mpc.alloc_bytes += size
 unlock(&proflock)

 // Setprofilebucket locks a bunch of other mutexes, so we call it outside of proflock.
 // This reduces potential contention and chances of deadlocks.
 // Since the object must be alive during call to mProf_Malloc,
 // it's fine to do this non-atomically.
 systemstack(func() {
  setprofilebucket(p, b)
 })
}

func callers(skip int, pcbuf []uintptr) int {
 sp := getcallersp()
 pc := getcallerpc()
 gp := getg()
 var n int
 systemstack(func() {
  n = gentraceback(pc, sp, 0, gp, skip, &pcbuf[0], len(pcbuf), nil, nil, 0)
 })
 return n
}


 By calling  callers()  And further  gentraceback()  To obtain the current call stack.  stk  Array ( namely  PC  Array of addresses ), This technique is called call stack backtracking , It can be used in many scenarios ( Such as program  panic  Stack expansion at ).
 notes : The term  PC  finger  Program Counter, Specific to  x86-64  Platform is  RIP  register ;FP  finger  Frame Pointer, Specific to  x86-64  When is  RBP  register ;SP  finger  Stack Pointer, Specific to  x86-64  When is  RSP  register .
 An original implementation of call stack backtracking is in the function call Convention (Calling Convention) When a function call occurs on the  RBP  register (on x86-64) What is saved must be the stack base address , Instead of being used as a general-purpose register , because  call  The instruction will first  RIP ( The return address ) Push , We just need to ensure that the first data to be put on the stack is current  RBP, Then the stack base addresses of all functions are expressed as  RBP  For the head , String into a linked list of addresses . We just need to work for each  RBP  The address is shifted down one cell , You can get  RIP  It's an array of .Go FramePointer Backtrace( The picture is from  go-profiler-notes) notes : The picture mentions  Go  All parameters of are passed through the stack , This conclusion is now out of date ,Go  from  1.17  Register parameter transfer is supported in version .
 because  x86-64  take  RBP  General purpose register , Such as  GCC  Wait until the compiler no longer uses by default  RBP  Save stack base address , Unless you turn it on with a specific option . However  Go  The compiler retains this feature , So in  Go  Pass through  RBP  Stack backtracking is feasible .
 but  Go  Did not adopt this simple scheme , The reason is that in some special scenarios, the scheme will bring some problems , For example, if a function is  inline  It fell off , Then through the  RBP  The call stack obtained by backtracking is missing . In addition, this scheme needs to insert additional instructions between regular function calls , And need to occupy an additional general-purpose register , There is a certain performance overhead , Even if we don't need stack backtracking .
 Every  Go  All binaries contain a file named  gopclntab  Of  section, This is a  Go Program Counter Line Table  Abbreviation , It maintains  PC  To  SP  And its return address . So we don't have to rely on  FP, It can be directly completed by looking up the table  PC  Concatenation of linked lists . meanwhile  gopclntab  In the maintenance of  PC  And whether the function in which it is located has been inlined optimized , So we won't lose inline function frames during stack backtracking . Besides  gopclntab  The symbol table is also maintained , preservation  PC  Corresponding code information ( Function name , Number of lines, etc ), So we can finally see human readable  panic  Result or  profiling  result , Instead of a bunch of address information .gopclntab Be specific to  Go  Of  gopclntab  Different ,DWARF  Is a standardized debugging format ,Go  The compiler also generates  binary  Added  DWARF (v4)  Information , So some non  Go  The external tools of ecology can rely on it for  Go  Program debugging . It is worth mentioning that ,DWARF  The information contained is  gopclntab  Superset .
 go back to  Heap Profiling  Come on , When we use stack backtracking Technology ( In the previous code  gentraceback()  function ) Get  PC  After array , There is no need to rush to symbolize it directly , The overhead of symbolization is considerable , We can aggregate through the pointer address stack first . The so-called aggregation is in  hashmap  The same samples are accumulated in , The same sample refers to the sample with exactly the same contents of two arrays .
 adopt  stkbucket()  Function to  stk  by  key  Get the corresponding  bucket, Then the statistics related fields are accumulated .
 in addition , We noticed that  memRecord  There are multiple groups  memRecordCycle  statistics :type memRecord struct {
 active memRecordCycle
 future [3]memRecordCycle
}
 When accumulating, it is through  mProf.cycle  Global variables are used as subscript modules to access a specific set of  memRecordCycle.mProf.cycle  Every round  GC  It's going to increase , This records three rounds  GC  Distribution between . Only when one round  GC  After the end , Will be the last round  GC  To this round  GC  Memory allocation between 、 The release is incorporated into the final display statistics . This design is to avoid in  GC  Get... Before execution  Heap Profile, Show us a lot of useless temporary memory .
 also , In a round  GC  We may also see unstable heap memory state at different times of the cycle .
 The final call  setprofilebucket()  take  bucket  Record the information related to the assigned address  mspan  On , Used in subsequent  GC  Called when the  mProf_Free()  To record the corresponding release .
 That's it ,Go runtime  Always maintain this  bucket  aggregate , When we need to carry on  Heap Profiling  when ( Such as calling  pprof.WriteHeapProfile()  when ), Will visit this  bucket  aggregate , Convert to  pprof  Output the required format .
 This is also  Heap Profiling  And  CPU Profiling  A difference of :CPU Profiling  Only in  profiling  There is a certain sampling overhead for the application within the time window of , and  Heap Profiling  Sampling occurs all the time , Do it once  profiling  just  dump  Take a snapshot of the data so far .
 Next we will enter  C/C++/Rust  The world of , Fortunately, , Because most  Heap Profiler  The implementation principle of is similar , Many of the knowledge mentioned above is in the corresponding . Most typical ,Go Heap Profiling  In fact, from  Google tcmalloc  Transplanted , They have similar implementations .
Heap Profiling with gperftoolsgperftools(Google Performance Tools) It's a kit , Include  Heap Profiler、Heap Checker、CPU Profiler  Tools such as . The reason in  Go  Then we introduce it , It's because it's related to  Go  It has a deep origin .
 As mentioned earlier  Go runtime  Transplanted  Google tcmalloc  Two community versions have been differentiated from the inside : One is  tcmalloc, That is, pure  malloc  Realization , No additional functions are included ; The other is  gperftools, It's with  Heap Profiling  The ability of  malloc  Realization , And other supporting toolsets .
 among  pprof  It is also one of the most well-known tools .pprof  Early was a  perl  Script , Then it evolved into  Go  Write powerful tools  pprof, Now it has been integrated into  Go  The trunk , We usually use  go tool pprof  The internal command is used directly  pprof  package .
 notes :gperftools  The main author of  Sanjay Ghemawat, And  Jeff Dean  Pair programming bulls .
UsageGoogle  It has been used internally  gperftools  Of  Heap Profiler  analysis  C++  Heap memory allocation for programs , It can do :
Figuring out what is in the program heap at any given timeLocating memory leaksFinding places that do a lot of allocation
 As  Go pprof  The ancestors of the , Looks like  Go  Provided  Heap Profiling  The abilities are the same .
Go  Is directly in  runtime  The memory allocation function in is hard coded into the acquisition code , A similar ,gperftools  In what it provides  libtcmalloc  Of  malloc  The acquisition code is embedded in the implementation . The user needs to execute... During the project compilation link phase  -ltcmalloc  Link the library , Replace  libc  default  malloc  Realization .
 Of course , We can also rely on  Linux  Dynamic link mechanism to replace at run time :$ env LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary>


 When using  LD_PRELOAD  It specifies  libtcmalloc.so  after , The default link in our program  malloc()  It's covered ,Linux  The dynamic linker ensures priority execution  LD_PRELOAD  The specified version .
 The link is running  libtcmalloc  Before the executable , If we put the environment variable  HEAPPROFILE  Set to a file name , So when the program executes ,Heap Profile  The data will be written to the file .
 By default , Whenever our program assigns  1g  Memory time , Or whenever the program's memory usage high watermark increases  100mb  when , It's all done once  Heap Profile  Of  dump. These parameters can be modified by environment variables .
 Use  gperftools  Self contained  pprof  Scripts can analyze  dump  Coming out  profile  file , Usage and  Go  Basically the same . $ pprof --gv gfs_master /tmp/profile.0100.heap
gperftools gv$ pprof --text gfs_master /tmp/profile.0100.heap
   255.6  24.7%  24.7%    255.6  24.7% GFS_MasterChunk::AddServer
   184.6  17.8%  42.5%    298.8  28.8% GFS_MasterChunkTable::Create
   176.2  17.0%  59.5%    729.9  70.5% GFS_MasterChunkTable::UpdateState
   169.8  16.4%  75.9%    169.8  16.4% PendingClone::PendingClone
    76.3   7.4%  83.3%     76.3   7.4% __default_alloc_template::_S_chunk_alloc
    49.5   4.8%  88.0%     49.5   4.8% hashtable::resize
   ...
 alike , From left to right  Flat(mb),Flat%,Sum%,Cum(mb),Cum%,Name.
Implementation details Allied ,tcmalloc  stay  malloc()  and  operator new  Some sampling logic has been added in , When sampling is triggered according to conditions  hook  when , The following functions will be executed :// Record an allocation in the profile.
static void RecordAlloc(const void* ptr, size_t bytes, int skip_count) {
  // Take the stack trace outside the critical section.
void* stack[HeapProfileTable::kMaxStackDepth];
  int depth = HeapProfileTable::GetCallerStackTrace(skip_count + 1, stack);
  SpinLockHolder l(&heap_lock);
  if (is_on) {
    heap_profile->RecordAlloc(ptr, bytes, depth, stack);
    MaybeDumpProfileLocked();
  }
}

void HeapProfileTable::RecordAlloc(
    const void* ptr, size_t bytes, int stack_depth,
    const void* const call_stack[]) {
  Bucket* b = GetBucket(stack_depth, call_stack);
  b->allocs++;
  b->alloc_size += bytes;
  total_.allocs++;
  total_.alloc_size += bytes;

  AllocValue v;
  v.set_bucket(b);  // also did set_live(false); set_ignore(false)
  v.bytes = bytes;
  address_map_->Insert(ptr, v);
}
 The execution process is as follows : call  GetCallerStackTrace()  Get call stack . Take the call stack as  hashmap  Of  key  call  GetBucket()  Get the corresponding  Bucket. Add up  Bucket  Statistics in .
 Because it's gone  GC  The existence of , Compared with the sampling process  Go  It's a lot easier . In terms of variable naming ,Go runtime  Medium  profiling  The code is indeed transplanted from here .
sampler.h  Detailed in  gperftools  Sampling rules for , Generally speaking, it is also related to  Go  Agreement , namely :512k average sample step.
 stay  free()  or  operator delete  You also need to add some logic to record the memory release , It's better than having  GC  Of  Go  It's also much simpler :// Record a deallocation in the profile.
static void RecordFree(const void* ptr) {
  SpinLockHolder l(&heap_lock);
  if (is_on) {
    heap_profile->RecordFree(ptr);
    MaybeDumpProfileLocked();
  }
}

void HeapProfileTable::RecordFree(const void* ptr) {
  AllocValue v;
  if (address_map_->FindAndRemove(ptr, &v)) {
    Bucket* b = v.bucket();
    b->frees++;
    b->free_size += v.bytes;
    total_.frees++;
    total_.free_size += v.bytes;
  }
}

 Find the appropriate  Bucket, Add up  free  Relevant fields are sufficient .
 modern  C/C++/Rust  The process of getting the call stack by a program usually depends on  libunwind  The library did ,libunwind  The principle and method of stack backtracking  Go  similar , No choice  Frame Pointer  Backtracking mode , Are dependent on a specific in the program  section  Recorded  unwind table. The difference is ,Go  What we rely on is the one created in our own ecology called  gopclntab  Specific to  section, and  C/C++/Rust  The program depends on  .debug_frame section  or  .eh_frame section.
 among  .debug_frame  by  DWARF  Standard definition ,Go  The compiler also writes this information , But you don't have to , Only for third-party tools .GCC  Only open  -g  Parameter will be added to  .debug_frame  Write debug information .
 and  .eh_frame  More modern , stay  Linux Standard Base  In the definition of . The principle is to let the compiler insert some pseudo instructions in the corresponding position of the assembly code (CFI Directives,Call Frame Information), To help the assembler generate the final include  unwind table  Of  .eh_frame section.
 Take the following code as an example :// demo.c

int add(int a, int b) {
    return a + b;
}
 We use  cc -S demo.c  To generate assembly code (gcc/clang  All possible ), Note that... Is not used here  -g  Parameters .  .section __TEXT,__text,regular,pure_instructions
 .build_version macos, 11, 0 sdk_version 11, 3
 .globl _add                            ## -- Begin function add
 .p2align 4, 0x90
_add:                                   ## @add
 .cfi_startproc
## %bb.0:
 pushq %rbp
 .cfi_def_cfa_offset 16
 .cfi_offset %rbp, -16
 movq %rsp, %rbp
 .cfi_def_cfa_register %rbp
 movl %edi, -4(%rbp)
 movl %esi, -8(%rbp)
 movl -4(%rbp), %eax
 addl -8(%rbp), %eax
 popq %rbp
 retq
 .cfi_endproc
                                        ## -- End function
.subsections_via_symbols

 From the generated assembly code, you can see a lot to  .cfi_  Pseudo instructions prefixed with , They are  CFI Directives.
Heap Profiling with jemalloc Next we focus on  jemalloc, This is because  TiKV  By default  jemalloc  As a memory allocator , Can it be in  jemalloc  Go smoothly on  Heap Profiling  It is a point worthy of our attention .
Usagejemalloc  Bring it with you  Heap Profiling  Ability , But it's not on by default , You need to specify... At compile time  --enable-prof  Parameters ../autogen.sh
./configure --prefix=/usr/local/jemalloc-5.1.0 --enable-prof
make
make install


 And  tcmalloc  identical , We can choose to pass  -ljemalloc  take  jemalloc  Link to program , Or through  LD_PRELOAD  use  jemalloc  Cover  libc  Of  malloc()  Realization .
 We use  Rust  Program as an example to show how to pass  jemalloc  Conduct  Heap Profiling.fn main() {
    let mut data = vec![];
    loop {
        func1(&mut data);
        std::thread::sleep(std::time::Duration::from_secs(1));
    }
}

fn func1(data: &mut Vec<Box<[u8; 1024*1024]>>) {
    data.push(Box::new([0u8; 1024*1024])); // alloc 1mb
    func2(data);
}

fn func2(data: &mut Vec<Box<[u8; 1024*1024]>>) {
    data.push(Box::new([0u8; 1024*1024])); // alloc 1mb
}




 And  Go  Provided in the section  demo  similar , We are also in  Rust  Allocated per second  2mb  Heap memory ,func1  and  func2  Each distribution  1mb, from  func1  call  func2.

 Use it directly  rustc  Compile the file without any parameters , Then execute the following command to start the program :$ export MALLOC_CONF="prof:true,lg_prof_interval:25"
$ export LD_PRELOAD=/usr/lib/libjemalloc.so
$ ./demo




MALLOC_CONF  Is used to specify the  jemalloc  Related parameters of , among  prof:true  Open for indication  profiler,log_prof_interval:25  Indicates each assignment  2^25  byte (32mb) Heap memory is  dump  One copy  profile  file .
 notes : more  MALLOC_CONF  Options can be found in the documentation .
 Wait a while , You can see some  profile  Document generation .jemalloc  Provides a and  tcmalloc  Of  pprof  Similar tools , It's called  jeprof, In fact, it is made of  pprof perl  Script  fork  And here comes , We can use  jeprof  To review  profile  file .$ jeprof ./demo jeprof.7262.0.i0.heap

 Can also be generated with  Go/gperftools  same  graph:$ jeprof --gv ./demo jeprof.7262.0.i0.heap

jeprof svg
Implementation details And  tcmalloc  similar ,jemalloc  stay  malloc()  Sampling logic is added in :JEMALLOC_ALWAYS_INLINE int
imalloc_body(static_opts_t *sopts, dynamic_opts_t *dopts, tsd_t *tsd) {
 // ...
 // If profiling is on, get our profiling context.
 if (config_prof && opt_prof) {
  bool prof_active = prof_active_get_unlocked();
  bool sample_event = te_prof_sample_event_lookahead(tsd, usize);
  prof_tctx_t *tctx = prof_alloc_prep(tsd, prof_active,
      sample_event);

  emap_alloc_ctx_t alloc_ctx;
  if (likely((uintptr_t)tctx == (uintptr_t)1U)) {
   alloc_ctx.slab = (usize <= SC_SMALL_MAXCLASS);
   allocation = imalloc_no_sample(
       sopts, dopts, tsd, usize, usize, ind);
  } else if ((uintptr_t)tctx > (uintptr_t)1U) {
   allocation = imalloc_sample(
       sopts, dopts, tsd, usize, ind);
   alloc_ctx.slab = false;
  } else {
   allocation = NULL;
  }

  if (unlikely(allocation == NULL)) {
   prof_alloc_rollback(tsd, tctx);
   goto label_oom;
  }
  prof_malloc(tsd, allocation, size, usize, &alloc_ctx, tctx);
 } else {
  assert(!opt_prof);
  allocation = imalloc_no_sample(sopts, dopts, tsd, size, usize,
      ind);
  if (unlikely(allocation == NULL)) {
   goto label_oom;
  }
 }
 // ...
}




 stay  prof_malloc()  Call in  prof_malloc_sample_object()  Yes  hashmap  Accumulate the corresponding call stack records in :void
prof_malloc_sample_object(tsd_t *tsd, const void *ptr, size_t size,
    size_t usize, prof_tctx_t *tctx) {
 // ...
 malloc_mutex_lock(tsd_tsdn(tsd), tctx->tdata->lock);
 size_t shifted_unbiased_cnt = prof_shifted_unbiased_cnt[szind];
 size_t unbiased_bytes = prof_unbiased_sz[szind];
 tctx->cnts.curobjs++;
 tctx->cnts.curobjs_shifted_unbiased += shifted_unbiased_cnt;
 tctx->cnts.curbytes += usize;
 tctx->cnts.curbytes_unbiased += unbiased_bytes;
 // ...
}
jemalloc  stay  free()  The logic injected in is also related to  tcmalloc  similar , meanwhile  jemalloc  Also depends on  libunwind  Perform stack backtracking , I will not repeat here .
Heap Profiling with bytehoundBytehound  Is a  Linux  Platform  Memory Profiler, use  Rust  To write . The feature is that the front-end functions provided are relatively rich , Our focus is on how it is implemented , And whether it can be in  TiKV  Use in , So just briefly introduce the basic usage .
Usage We can do it in  Releases  Page download  bytehound  Binary dynamic library , Only  Linux  Platform support .
 then , image  tcmalloc  or  jemalloc  equally , adopt  LD_PRELOAD  Mount its own implementation . Here we assume that the running is  Heap Profiling with jemalloc  The same section with memory leaks  Rust  Program :$ LD_PRELOAD=./libbytehound.so ./demo
 Next, a... Will be generated in the working directory of the program  memory-profiling_*.dat  file , This is  bytehound  Of  Heap Profiling  product . Be careful , And others  Heap Profiler  The difference is , This file is constantly updated , Instead of generating a new file every specific time .
 Next, execute the following command to open a  web  The port is used to analyze the above files in real time :$ ./bytehound server memory-profiling_*.dat
Bytehound GUI The most intuitive way is to click the... In the upper right corner  Flamegraph  Look at the flame diagram :Bytehound Flamegraph It is easy to see from the picture that  demo::func1  And  demo::func2  Memory hotspots .
Bytehound  Provides a wealth of  GUI  function , This is one of its highlights , You can refer to the document to explore by yourself .
Implementation detailsBytehound  It also replaces the user's default  malloc  Realization , but  bytehound  There is no memory allocator implemented by itself , It's based on  jemalloc  It's packed .//  entrance 
#[cfg_attr(not(test), no_mangle)]
pub unsafe extern "C" fn malloc( size: size_t ) -> *mut c_void {
    allocate( size, AllocationKind::Malloc )
}

#[inline(always)]
unsafe fn allocate( requested_size: usize, kind: AllocationKind ) -> *mut c_void {
    // ...
    //  call  jemalloc  Memory allocation 
    let pointer = match kind {
        AllocationKind::Malloc => {
            if opt::get().zero_memory {
                calloc_real( effective_size as size_t, 1 )
            } else {
                malloc_real( effective_size as size_t )
            }
        },
        // ...
    };
    // ...
    //  Stack backtracking 
    let backtrace = unwind::grab( &mut thread );
    // ...
    //  Record the sample 
    on_allocation( id, allocation, backtrace, thread );
    pointer
}

// xxx_real  link to  jemalloc  Realization 
#[cfg(feature = "jemalloc")]
extern "C" {
    #[link_name = "_rjem_mp_malloc"]
    fn malloc_real( size: size_t ) -> *mut c_void;
    // ...
}

 It seems that every time  malloc  Stack backtracking and recording will be carried out at all times , No sampling logic . And in the  on_allocation hook  in , The allocation record was sent to  channel, By a unified  processor  Thread for asynchronous processing .pub fn on_allocation(
    id: InternalAllocationId,
    allocation: InternalAllocation,
    backtrace: Backtrace,
    thread: StrongThreadHandle
) {
    // ...
    crate::event::send_event_throttled( move || {
        InternalEvent::Alloc {
            id,
            timestamp,
            allocation,
            backtrace,
        }
    });
}

#[inline(always)]
pub(crate) fn send_event_throttled< F: FnOnce() -> InternalEvent >( callback: F ) {
    EVENT_CHANNEL.chunked_send_with( 64, callback );
}


 and  EVENT_CHANNEL  The implementation of is simple  Mutex<Vec<T>>:pub struct Channel< T > {
    queue: Mutex< Vec< T > >,
    condvar: Condvar
}

Performance overhead In this section, let's explore the various  Heap Profiler  Performance overhead , The specific measurement methods vary from scene to scene .
 All tests are run separately in the following physical machine environment :
Go stay  Go  Our measurement method is to use  TiDB + unistore  Deploy a single node , in the light of  runtime.MemProfileRate  Adjust the parameters with  sysbench  Take measurements .
 Relevant software version and pressure measurement parameter data : The resulting data :
 Comparison  “ Don't record ”  Come on , Whether it's  TPS/QPS, still  P95  Delay line ,512k  The performance loss of sampling records is basically  1%  within . and  “ Full volume record ”  The performance overhead is consistent with “ It will be very high ” The expected , But it was unexpectedly high :TPS/QPS  Shrunk  20  times ,P95  The delay has increased  30  times .
 because  Heap Profiling  It is a general function , We cannot accurately give the general performance loss in all scenarios , Only the measurement conclusion under a specific project is valuable .TiDB  It is a relatively computing intensive application , Memory allocation may not be as frequent as some memory intensive applications , Therefore, the conclusion ( And all subsequent conclusions ) For reference only , Readers can measure the cost of their own application scenarios .
tcmalloc/jemalloc We are based on  TiKV  To measure  tcmalloc/jemalloc, The method is to deploy one on the machine  PD  Process and a  TiKV  process , use  go-ycsb  Carry out pressure test , Key parameters are as follows :threadcount=200
recordcount=100000
operationcount=1000000
fieldcount=20
 Start up  TiKV  Use... Separately before  LD_PRELOAD  Inject different  malloc hook. among  tcmalloc  Use default configuration , It's like  Go  Of  512k  sampling ;jemalloc  Use the default sampling policy , And every allocation  1G  Heap memory is  dump  One copy  profile  file .
 Finally, the following data are obtained :tcmalloc  And  jemalloc  The performance of the team is almost the same ,OPS  Compared with the default memory allocator  4%  about ,P99  The delay line has risen  10%  about .
 We've learned before  tcmalloc  Implementation and  Go heap pprof  The implementation of is basically the same , But the data measured here are not quite consistent , The reason is that  TiKV  And  TiDB  There are differences in memory allocation characteristics , This also confirms what I said earlier :“ We cannot accurately give the general performance loss in all scenarios , Only the measurement conclusion under a specific project is valuable ”.
bytehound We didn't  bytehound  And  tcmalloc/jemalloc  The reason for putting them together is in  TiKV  In practice  bytehound  You will encounter deadlock problems during the startup phase .
 Because we speculate  bytehound  The performance overhead will be very high , Theoretically, it cannot be applied to  TiKV  Production environment , So we just need to confirm this conclusion .
 notes : It is speculated that the reason for the high performance overhead is  bytehound  Sampling logic not found in code , Every time the collected data passes  channel  Sent to the background thread for processing , and  channel  It's just simple to use  Mutex + Vec  Encapsulated the next .
 We choose a simple  mini-redis  Project to measure  bytehound  Performance overhead , Because the goal is only to confirm whether  TiKV  Requirements for use in production environment , Instead of accurately measuring data , So we just make a simple statistics and compare it  TPS  that will do , Specifically  driver  The code snippet is as follows :var count int32

for n := 0; n < 128; n++ {
 go func() {
  for {
   key := uuid.New()
   err := client.Set(key, key, 0).Err()
   if err != nil {
    panic(err)
   }
   err = client.Get(key).Err()
   if err != nil {
    panic(err)
   }
   atomic.AddInt32(&count, 1)
  }
 }()
}

 Turn on  128 goroutine  Yes  server  Read and write , Once read / Writing is considered a complete  operation, Only the number of times is counted , No measurement delay and other indicators , Total number of final uses divided by execution time , Get turned on  bytehound  The difference between before and after  TPS, The data are as follows : From the results  TPS  lost  50%  above .
What can BPF bring although  BPF  Low performance overhead , But based on  BPF  To a large extent, we can only get the indicators at the system level , In the ordinary sense  Heap Profiling  Statistics need to be made on the memory allocation link , But memory allocation tends to be hierarchical .
 for instance , If we advance in our own program  malloc  A large heap of memory is used as a memory pool , I designed the allocation algorithm , Next, all heap memory required by business logic is allocated from the memory pool , So the existing  Heap Profiler  It won't work . Because it only tells you that you applied for a large amount of memory during the startup phase , The number of memory requests at other times is 0. In this scenario, we need to invade the memory allocation code designed by ourselves , Do... At the entrance  Heap Profiler  What to do .
BPF  The problem is similar , We can hang a hook on  brk/sbrk  On , When the user state really needs to apply to the kernel for heap memory expansion , For the current  stack trace  For recording . However, the memory allocator is a complex black box , Most often trigger  brk/sbrk  The user stack is not necessarily the user stack that causes the memory leak . This needs to be verified by some experiments , If the results are really valuable , It will be  BPF  As a low-cost bottom plan, long-term operation is not impossible ( Need extra consideration  BPF  Permission issues for ).
 as for  uprobe, Just non intrusive code implantation , about  Heap Profiling  It still has to be in  allocator  Take the same logic , This brings the same overhead , And we are not sensitive to code intrusion .
https://github.com/parca-dev/parca  Based on  BPF  Of  Continuous Profiling, But it really took advantage of  BPF  In fact, there are only  CPU Profiler. stay  bcc-tools  There is already a  Python  Tools are used to do  CPU Profiling(https://github.com/iovisor/bcc/blob/master/tools/profile.py), The core principle is the same . about  Heap Profiling  There is not much reference significance for the time being .
原网站

版权声明
本文为[PingCAP]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/11/20211118173422797Y.html