当前位置:网站首页>Location and troubleshooting of memory leakage: analysis of heap profiling principle
Location and troubleshooting of memory leakage: analysis of heap profiling principle
2022-06-24 01:25:00 【PingCAP】
After the system runs for a long time , Less and less memory available , It even led to the failure of some services , This is a typical memory leak problem . Such problems are often difficult to predict , It is also difficult to locate by static code sorting .Heap Profiling Is to help us solve such problems .
TiKV As part of a distributed system , Have initially owned Heap Profiling The ability of . This article will introduce some common Heap Profiler The realization principle and use method of , Help readers understand more easily TiKV In this paper, we discuss the implementation of this method , Or better apply this kind of analysis method to your own project .
What is? Heap Profiling
Memory leakage at runtime is quite difficult to troubleshoot in many scenarios , Because such problems are often difficult to predict , It is also difficult to locate by static code sorting .
Heap Profiling Is to help us solve such problems .
Heap Profiling It usually refers to collecting or sampling the heap allocation of an application , To report the memory usage of the program , In order to analyze the cause of memory occupation or locate the root cause of memory leakage .
Heap Profiling How it works
As a contrast , Let's take a quick look CPU Profiling How it works .
When we are ready to CPU Profiling when , Usually you need to select a Time window , In this window ,CPU Profiler It registers a scheduled execution with the target program hook( There are many means , for example SIGPROF The signal ), In this hook We will get the current... Of the business thread every time stack trace.
We will hook The execution frequency is controlled at a specific value , for example 100hz, In this way, every 10ms Collect a call stack sample of business code . When the time window is over , We aggregate all the samples collected , Finally, the number of times each function is collected , Compared with the total number of samples, we get the of each function Relative proportion .
With the help of this model, we can find the function with high proportion , And then positioning CPU hotspot .
In terms of data structure ,Heap Profiling And CPU Profiling Very similar , All are stack trace + statistics Model of . If you have Go Provided pprof, You will find that the display format of the two is almost the same :
Go CPU Profile
Go Heap Profile
And CPU Profiling The difference is ,Heap Profiling The data acquisition of is not simply carried out through a timer , Instead, you need to invade the memory allocation path , In this way, you can get the amount of memory allocated . therefore Heap Profiler The usual way is Integrate yourself directly into the memory allocator , When the application allocates memory, it gets the current stack trace, Finally, all samples are aggregated together , So we can know Each function allocates the amount of memory directly or indirectly 了 .
Heap Profile Of stack trace + statistics Data model and CPU Proflie It's consistent .
Next, we will introduce several Heap Profiler The use and implementation principle of .
notes : Such as GNU gprof、Valgrind The usage scenario of such tools does not match our target scenario , Therefore, this article will not expand . Reason reference gprof, Valgrind and gperftools - an evaluation of some tools for application level CPU profiling on Linux - Gernot.Klingler.
Heap Profiling in Go
Most readers should be right about Go Will be more familiar with , So we have Go As the starting point and base for research .
notes : If a concept we talked about in the previous section , The following sections will not repeat , Even if they are not the same project . In addition, for integrity purposes , Each project is equipped with usage Section to explain its usage , Students who are already familiar with this can skip .
Usage
Go runtime Built in convenient profiler,heap Is one of the types . We can open a debug port :
import _ "net/http/pprof"
go func() {
log.Print(http.ListenAndServe("0.0.0.0:9999", nil))
}()
Then use the command line to get the current... While the program is running Heap Profiling snapshot :$ go tool pprof http://127.0.0.1:9999/debug/pprof/heap
Or you can get it directly once at a specific location in the application code Heap Profiling snapshot :import "runtime/pprof"
pprof.WriteHeapProfile(writer)
Here we use a complete demo Let's string heap pprof Usage of :package main
import (
"log"
"net/http"
_ "net/http/pprof"
"time"
)
func main() {
go func() {
log.Fatal(http.ListenAndServe(":9999", nil))
}()
var data [][]byte
for {
data = func1(data)
time.Sleep(1 * time.Second)
}
}
func func1(data [][]byte) [][]byte {
data = func2(data)
return append(data, make([]byte, 1024*1024)) // alloc 1mb
}
func func2(data [][]byte) [][]byte {
return append(data, make([]byte, 1024*1024)) // alloc 1mb
}
The code continues to func1 and func2 Allocate memory separately , A total of... Are allocated per second 2mb Heap memory .
After running the program for a period of time , Execute the following command to get profile Take a snapshot and open a web Services to browse :$ go tool pprof -http=":9998" localhost:9999/debug/pprof/heap
Go Heap Graph From the figure, we can intuitively see which functions account for the majority of memory allocation ( The box is bigger ), At the same time, you can also intuitively see the function call relationship ( By wire ). For example, it is obvious in the figure above that func1 and func2 The distribution of accounts for the majority , And func2 By func1 call .
Be careful , because Heap Profiling Also sampled ( By default, every allocation 512k Take a sample ), So the memory size shown here is smaller than the actual allocated memory size . Same as CPU Profiling equally , This value is only used to calculate the relative proportion , Then locate the memory allocation hotspot .
notes : in fact ,Go runtime There is logic to estimate the original size of the sampled results , But this conclusion is not necessarily accurate .
Besides ,func1 In the box 48.88% of 90.24% Express Flat% of Cum%.
What is? Flat% and Cum%? Let's change the way we browse , In the upper left corner View Click... From the drop-down bar Top:
Go Heap TopName The list shows the corresponding function name Flat The list shows how much memory is allocated by the function itself Flat% Column means Flat Proportion of relative total allocation size Cum The list shows the function , And all the sub functions it calls Cum% Column means Cum Proportion of relative total allocation size
Sum% The list shows top-down Flat% Accumulation ( You can intuitively judge from which line up how much memory is allocated ) The above two methods can help us locate specific functions ,Go Provides finer grained allocation source statistics at the line count level , In the upper left corner View Click... From the drop-down bar Source:Go Heap Source stay CPU Profiling In, we often use the flame diagram to find the wide top to locate the hot spot function quickly and intuitively . Of course , Due to the homogeneity of the data model ,Heap Profiling The data can also be displayed by flame diagram , In the upper left corner View Click... From the drop-down bar Flame Graph:Go Heap Flamegraph Through the above methods, we can easily see that the main memory allocation is func1 and func2. However, in the real scene, it will never be so simple for us to locate the root cause of the problem , Because we got a snapshot of a moment , This is not enough for memory leaks , What we need is an incremental data , To determine which memory is growing continuously . So you can get it again after a certain interval Heap Profile, Do... On the two results diff.
Implementation details In this section we focus on Go Heap Profiling Implementation principle of .
review “Heap Profiling How it works ” section ,Heap Profiler The usual approach is to integrate yourself directly into the memory allocator , When the application allocates memory, it gets the current stack trace, and Go That's what it did .
Go The memory allocation entry for is src/runtime/malloc.go Medium mallocgc() function , One of the key codes is as follows :func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
// ...
if rate := MemProfileRate; rate > 0 {
// Note cache c only valid while m acquired; see #47302
if rate != 1 && size < c.nextSample {
c.nextSample -= size
} else {
profilealloc(mp, x, size)
}
}
// ...
}
func profilealloc(mp *m, x unsafe.Pointer, size uintptr) {
c := getMCache()
if c == nil {
throw("profilealloc called without a P or outside bootstrapping")
}
c.nextSample = nextSample()
mProf_Malloc(x, size)
}
That means , Every time I pass mallocgc() Distribute 512k Heap memory for , Just call profilealloc() Record once stack trace.
Why do you need to define a sampling granularity ? Every time mallocgc() Record the current stack trace Isn't it more accurate ?
It seems more attractive to get the memory allocation of all functions completely and accurately , But the performance overhead is huge .malloc() As a user state library function, it will be called very frequently by applications , Optimizing memory allocation performance is allocator The responsibility of the . If every time malloc() Calls are accompanied by a stack backtrace , The cost is almost unacceptable , Especially on the server side, it continues for a long time profiling Scene . choice “ sampling ” Not better results , It's just a compromise .
Of course , We can also modify it ourselves MemProfileRate Variable , Set it to 1 Will cause every time mallocgc() There must be stack trace Record , Set to 0 Will shut down completely Heap Profiling, Users can weigh performance against accuracy based on actual scenarios .
Be careful , When we will MemProfileRate When set to a normal sampling granularity , This value is not entirely accurate , But every time with MemProfileRate Take a random value from the exponential distribution of the mean .// nextSample returns the next sampling point for heap profiling. The goal is
// to sample allocations on average every MemProfileRate bytes, but with a
// completely random distribution over the allocation timeline; this
// corresponds to a Poisson process with parameter MemProfileRate. In Poisson
// processes, the distance between two samples follows the exponential
// distribution (exp(MemProfileRate)), so the best return value is a random
// number taken from an exponential distribution whose mean is MemProfileRate.
func nextSample() uintptr
Because memory allocation is regular in many cases , If sampling is carried out according to a fixed particle size , The final result may have great error , It may happen that each sampling catches up with a specific type of memory allocation . That's why randomization is chosen here .
It's not just Heap Profiling, be based on sampling All kinds of profiler There will always be some errors ( example :SafePoint Bias), Based on sampling Of profiling As a result , You need to remind yourself not to ignore the possibility of errors .
be located src/runtime/mprof.go Of mProf_Malloc() Function is responsible for specific sampling work :// Called by malloc to record a profiled block.
func mProf_Malloc(p unsafe.Pointer, size uintptr) {
var stk [maxStack]uintptr
nstk := callers(4, stk[:])
lock(&proflock)
b := stkbucket(memProfile, size, stk[:nstk], true)
c := mProf.cycle
mp := b.mp()
mpc := &mp.future[(c+2)%uint32(len(mp.future))]
mpc.allocs++
mpc.alloc_bytes += size
unlock(&proflock)
// Setprofilebucket locks a bunch of other mutexes, so we call it outside of proflock.
// This reduces potential contention and chances of deadlocks.
// Since the object must be alive during call to mProf_Malloc,
// it's fine to do this non-atomically.
systemstack(func() {
setprofilebucket(p, b)
})
}
func callers(skip int, pcbuf []uintptr) int {
sp := getcallersp()
pc := getcallerpc()
gp := getg()
var n int
systemstack(func() {
n = gentraceback(pc, sp, 0, gp, skip, &pcbuf[0], len(pcbuf), nil, nil, 0)
})
return n
}
By calling callers() And further gentraceback() To obtain the current call stack. stk Array ( namely PC Array of addresses ), This technique is called call stack backtracking , It can be used in many scenarios ( Such as program panic Stack expansion at ).
notes : The term PC finger Program Counter, Specific to x86-64 Platform is RIP register ;FP finger Frame Pointer, Specific to x86-64 When is RBP register ;SP finger Stack Pointer, Specific to x86-64 When is RSP register .
An original implementation of call stack backtracking is in the function call Convention (Calling Convention) When a function call occurs on the RBP register (on x86-64) What is saved must be the stack base address , Instead of being used as a general-purpose register , because call The instruction will first RIP ( The return address ) Push , We just need to ensure that the first data to be put on the stack is current RBP, Then the stack base addresses of all functions are expressed as RBP For the head , String into a linked list of addresses . We just need to work for each RBP The address is shifted down one cell , You can get RIP It's an array of .Go FramePointer Backtrace( The picture is from go-profiler-notes) notes : The picture mentions Go All parameters of are passed through the stack , This conclusion is now out of date ,Go from 1.17 Register parameter transfer is supported in version .
because x86-64 take RBP General purpose register , Such as GCC Wait until the compiler no longer uses by default RBP Save stack base address , Unless you turn it on with a specific option . However Go The compiler retains this feature , So in Go Pass through RBP Stack backtracking is feasible .
but Go Did not adopt this simple scheme , The reason is that in some special scenarios, the scheme will bring some problems , For example, if a function is inline It fell off , Then through the RBP The call stack obtained by backtracking is missing . In addition, this scheme needs to insert additional instructions between regular function calls , And need to occupy an additional general-purpose register , There is a certain performance overhead , Even if we don't need stack backtracking .
Every Go All binaries contain a file named gopclntab Of section, This is a Go Program Counter Line Table Abbreviation , It maintains PC To SP And its return address . So we don't have to rely on FP, It can be directly completed by looking up the table PC Concatenation of linked lists . meanwhile gopclntab In the maintenance of PC And whether the function in which it is located has been inlined optimized , So we won't lose inline function frames during stack backtracking . Besides gopclntab The symbol table is also maintained , preservation PC Corresponding code information ( Function name , Number of lines, etc ), So we can finally see human readable panic Result or profiling result , Instead of a bunch of address information .gopclntab Be specific to Go Of gopclntab Different ,DWARF Is a standardized debugging format ,Go The compiler also generates binary Added DWARF (v4) Information , So some non Go The external tools of ecology can rely on it for Go Program debugging . It is worth mentioning that ,DWARF The information contained is gopclntab Superset .
go back to Heap Profiling Come on , When we use stack backtracking Technology ( In the previous code gentraceback() function ) Get PC After array , There is no need to rush to symbolize it directly , The overhead of symbolization is considerable , We can aggregate through the pointer address stack first . The so-called aggregation is in hashmap The same samples are accumulated in , The same sample refers to the sample with exactly the same contents of two arrays .
adopt stkbucket() Function to stk by key Get the corresponding bucket, Then the statistics related fields are accumulated .
in addition , We noticed that memRecord There are multiple groups memRecordCycle statistics :type memRecord struct {
active memRecordCycle
future [3]memRecordCycle
}
When accumulating, it is through mProf.cycle Global variables are used as subscript modules to access a specific set of memRecordCycle.mProf.cycle Every round GC It's going to increase , This records three rounds GC Distribution between . Only when one round GC After the end , Will be the last round GC To this round GC Memory allocation between 、 The release is incorporated into the final display statistics . This design is to avoid in GC Get... Before execution Heap Profile, Show us a lot of useless temporary memory .
also , In a round GC We may also see unstable heap memory state at different times of the cycle .
The final call setprofilebucket() take bucket Record the information related to the assigned address mspan On , Used in subsequent GC Called when the mProf_Free() To record the corresponding release .
That's it ,Go runtime Always maintain this bucket aggregate , When we need to carry on Heap Profiling when ( Such as calling pprof.WriteHeapProfile() when ), Will visit this bucket aggregate , Convert to pprof Output the required format .
This is also Heap Profiling And CPU Profiling A difference of :CPU Profiling Only in profiling There is a certain sampling overhead for the application within the time window of , and Heap Profiling Sampling occurs all the time , Do it once profiling just dump Take a snapshot of the data so far .
Next we will enter C/C++/Rust The world of , Fortunately, , Because most Heap Profiler The implementation principle of is similar , Many of the knowledge mentioned above is in the corresponding . Most typical ,Go Heap Profiling In fact, from Google tcmalloc Transplanted , They have similar implementations .
Heap Profiling with gperftoolsgperftools(Google Performance Tools) It's a kit , Include Heap Profiler、Heap Checker、CPU Profiler Tools such as . The reason in Go Then we introduce it , It's because it's related to Go It has a deep origin .
As mentioned earlier Go runtime Transplanted Google tcmalloc Two community versions have been differentiated from the inside : One is tcmalloc, That is, pure malloc Realization , No additional functions are included ; The other is gperftools, It's with Heap Profiling The ability of malloc Realization , And other supporting toolsets .
among pprof It is also one of the most well-known tools .pprof Early was a perl Script , Then it evolved into Go Write powerful tools pprof, Now it has been integrated into Go The trunk , We usually use go tool pprof The internal command is used directly pprof package .
notes :gperftools The main author of Sanjay Ghemawat, And Jeff Dean Pair programming bulls .
UsageGoogle It has been used internally gperftools Of Heap Profiler analysis C++ Heap memory allocation for programs , It can do :
Figuring out what is in the program heap at any given timeLocating memory leaksFinding places that do a lot of allocation
As Go pprof The ancestors of the , Looks like Go Provided Heap Profiling The abilities are the same .
Go Is directly in runtime The memory allocation function in is hard coded into the acquisition code , A similar ,gperftools In what it provides libtcmalloc Of malloc The acquisition code is embedded in the implementation . The user needs to execute... During the project compilation link phase -ltcmalloc Link the library , Replace libc default malloc Realization .
Of course , We can also rely on Linux Dynamic link mechanism to replace at run time :$ env LD_PRELOAD="/usr/lib/libtcmalloc.so" <binary>
When using LD_PRELOAD It specifies libtcmalloc.so after , The default link in our program malloc() It's covered ,Linux The dynamic linker ensures priority execution LD_PRELOAD The specified version .
The link is running libtcmalloc Before the executable , If we put the environment variable HEAPPROFILE Set to a file name , So when the program executes ,Heap Profile The data will be written to the file .
By default , Whenever our program assigns 1g Memory time , Or whenever the program's memory usage high watermark increases 100mb when , It's all done once Heap Profile Of dump. These parameters can be modified by environment variables .
Use gperftools Self contained pprof Scripts can analyze dump Coming out profile file , Usage and Go Basically the same . $ pprof --gv gfs_master /tmp/profile.0100.heap
gperftools gv$ pprof --text gfs_master /tmp/profile.0100.heap
255.6 24.7% 24.7% 255.6 24.7% GFS_MasterChunk::AddServer
184.6 17.8% 42.5% 298.8 28.8% GFS_MasterChunkTable::Create
176.2 17.0% 59.5% 729.9 70.5% GFS_MasterChunkTable::UpdateState
169.8 16.4% 75.9% 169.8 16.4% PendingClone::PendingClone
76.3 7.4% 83.3% 76.3 7.4% __default_alloc_template::_S_chunk_alloc
49.5 4.8% 88.0% 49.5 4.8% hashtable::resize
...
alike , From left to right Flat(mb),Flat%,Sum%,Cum(mb),Cum%,Name.
Implementation details Allied ,tcmalloc stay malloc() and operator new Some sampling logic has been added in , When sampling is triggered according to conditions hook when , The following functions will be executed :// Record an allocation in the profile.
static void RecordAlloc(const void* ptr, size_t bytes, int skip_count) {
// Take the stack trace outside the critical section.
void* stack[HeapProfileTable::kMaxStackDepth];
int depth = HeapProfileTable::GetCallerStackTrace(skip_count + 1, stack);
SpinLockHolder l(&heap_lock);
if (is_on) {
heap_profile->RecordAlloc(ptr, bytes, depth, stack);
MaybeDumpProfileLocked();
}
}
void HeapProfileTable::RecordAlloc(
const void* ptr, size_t bytes, int stack_depth,
const void* const call_stack[]) {
Bucket* b = GetBucket(stack_depth, call_stack);
b->allocs++;
b->alloc_size += bytes;
total_.allocs++;
total_.alloc_size += bytes;
AllocValue v;
v.set_bucket(b); // also did set_live(false); set_ignore(false)
v.bytes = bytes;
address_map_->Insert(ptr, v);
}
The execution process is as follows : call GetCallerStackTrace() Get call stack . Take the call stack as hashmap Of key call GetBucket() Get the corresponding Bucket. Add up Bucket Statistics in .
Because it's gone GC The existence of , Compared with the sampling process Go It's a lot easier . In terms of variable naming ,Go runtime Medium profiling The code is indeed transplanted from here .
sampler.h Detailed in gperftools Sampling rules for , Generally speaking, it is also related to Go Agreement , namely :512k average sample step.
stay free() or operator delete You also need to add some logic to record the memory release , It's better than having GC Of Go It's also much simpler :// Record a deallocation in the profile.
static void RecordFree(const void* ptr) {
SpinLockHolder l(&heap_lock);
if (is_on) {
heap_profile->RecordFree(ptr);
MaybeDumpProfileLocked();
}
}
void HeapProfileTable::RecordFree(const void* ptr) {
AllocValue v;
if (address_map_->FindAndRemove(ptr, &v)) {
Bucket* b = v.bucket();
b->frees++;
b->free_size += v.bytes;
total_.frees++;
total_.free_size += v.bytes;
}
}
Find the appropriate Bucket, Add up free Relevant fields are sufficient .
modern C/C++/Rust The process of getting the call stack by a program usually depends on libunwind The library did ,libunwind The principle and method of stack backtracking Go similar , No choice Frame Pointer Backtracking mode , Are dependent on a specific in the program section Recorded unwind table. The difference is ,Go What we rely on is the one created in our own ecology called gopclntab Specific to section, and C/C++/Rust The program depends on .debug_frame section or .eh_frame section.
among .debug_frame by DWARF Standard definition ,Go The compiler also writes this information , But you don't have to , Only for third-party tools .GCC Only open -g Parameter will be added to .debug_frame Write debug information .
and .eh_frame More modern , stay Linux Standard Base In the definition of . The principle is to let the compiler insert some pseudo instructions in the corresponding position of the assembly code (CFI Directives,Call Frame Information), To help the assembler generate the final include unwind table Of .eh_frame section.
Take the following code as an example :// demo.c
int add(int a, int b) {
return a + b;
}
We use cc -S demo.c To generate assembly code (gcc/clang All possible ), Note that... Is not used here -g Parameters . .section __TEXT,__text,regular,pure_instructions
.build_version macos, 11, 0 sdk_version 11, 3
.globl _add ## -- Begin function add
.p2align 4, 0x90
_add: ## @add
.cfi_startproc
## %bb.0:
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
addl -8(%rbp), %eax
popq %rbp
retq
.cfi_endproc
## -- End function
.subsections_via_symbols
From the generated assembly code, you can see a lot to .cfi_ Pseudo instructions prefixed with , They are CFI Directives.
Heap Profiling with jemalloc Next we focus on jemalloc, This is because TiKV By default jemalloc As a memory allocator , Can it be in jemalloc Go smoothly on Heap Profiling It is a point worthy of our attention .
Usagejemalloc Bring it with you Heap Profiling Ability , But it's not on by default , You need to specify... At compile time --enable-prof Parameters ../autogen.sh
./configure --prefix=/usr/local/jemalloc-5.1.0 --enable-prof
make
make install
And tcmalloc identical , We can choose to pass -ljemalloc take jemalloc Link to program , Or through LD_PRELOAD use jemalloc Cover libc Of malloc() Realization .
We use Rust Program as an example to show how to pass jemalloc Conduct Heap Profiling.fn main() {
let mut data = vec![];
loop {
func1(&mut data);
std::thread::sleep(std::time::Duration::from_secs(1));
}
}
fn func1(data: &mut Vec<Box<[u8; 1024*1024]>>) {
data.push(Box::new([0u8; 1024*1024])); // alloc 1mb
func2(data);
}
fn func2(data: &mut Vec<Box<[u8; 1024*1024]>>) {
data.push(Box::new([0u8; 1024*1024])); // alloc 1mb
}
And Go Provided in the section demo similar , We are also in Rust Allocated per second 2mb Heap memory ,func1 and func2 Each distribution 1mb, from func1 call func2.
Use it directly rustc Compile the file without any parameters , Then execute the following command to start the program :$ export MALLOC_CONF="prof:true,lg_prof_interval:25"
$ export LD_PRELOAD=/usr/lib/libjemalloc.so
$ ./demo
MALLOC_CONF Is used to specify the jemalloc Related parameters of , among prof:true Open for indication profiler,log_prof_interval:25 Indicates each assignment 2^25 byte (32mb) Heap memory is dump One copy profile file .
notes : more MALLOC_CONF Options can be found in the documentation .
Wait a while , You can see some profile Document generation .jemalloc Provides a and tcmalloc Of pprof Similar tools , It's called jeprof, In fact, it is made of pprof perl Script fork And here comes , We can use jeprof To review profile file .$ jeprof ./demo jeprof.7262.0.i0.heap
Can also be generated with Go/gperftools same graph:$ jeprof --gv ./demo jeprof.7262.0.i0.heap
jeprof svg
Implementation details And tcmalloc similar ,jemalloc stay malloc() Sampling logic is added in :JEMALLOC_ALWAYS_INLINE int
imalloc_body(static_opts_t *sopts, dynamic_opts_t *dopts, tsd_t *tsd) {
// ...
// If profiling is on, get our profiling context.
if (config_prof && opt_prof) {
bool prof_active = prof_active_get_unlocked();
bool sample_event = te_prof_sample_event_lookahead(tsd, usize);
prof_tctx_t *tctx = prof_alloc_prep(tsd, prof_active,
sample_event);
emap_alloc_ctx_t alloc_ctx;
if (likely((uintptr_t)tctx == (uintptr_t)1U)) {
alloc_ctx.slab = (usize <= SC_SMALL_MAXCLASS);
allocation = imalloc_no_sample(
sopts, dopts, tsd, usize, usize, ind);
} else if ((uintptr_t)tctx > (uintptr_t)1U) {
allocation = imalloc_sample(
sopts, dopts, tsd, usize, ind);
alloc_ctx.slab = false;
} else {
allocation = NULL;
}
if (unlikely(allocation == NULL)) {
prof_alloc_rollback(tsd, tctx);
goto label_oom;
}
prof_malloc(tsd, allocation, size, usize, &alloc_ctx, tctx);
} else {
assert(!opt_prof);
allocation = imalloc_no_sample(sopts, dopts, tsd, size, usize,
ind);
if (unlikely(allocation == NULL)) {
goto label_oom;
}
}
// ...
}
stay prof_malloc() Call in prof_malloc_sample_object() Yes hashmap Accumulate the corresponding call stack records in :void
prof_malloc_sample_object(tsd_t *tsd, const void *ptr, size_t size,
size_t usize, prof_tctx_t *tctx) {
// ...
malloc_mutex_lock(tsd_tsdn(tsd), tctx->tdata->lock);
size_t shifted_unbiased_cnt = prof_shifted_unbiased_cnt[szind];
size_t unbiased_bytes = prof_unbiased_sz[szind];
tctx->cnts.curobjs++;
tctx->cnts.curobjs_shifted_unbiased += shifted_unbiased_cnt;
tctx->cnts.curbytes += usize;
tctx->cnts.curbytes_unbiased += unbiased_bytes;
// ...
}
jemalloc stay free() The logic injected in is also related to tcmalloc similar , meanwhile jemalloc Also depends on libunwind Perform stack backtracking , I will not repeat here .
Heap Profiling with bytehoundBytehound Is a Linux Platform Memory Profiler, use Rust To write . The feature is that the front-end functions provided are relatively rich , Our focus is on how it is implemented , And whether it can be in TiKV Use in , So just briefly introduce the basic usage .
Usage We can do it in Releases Page download bytehound Binary dynamic library , Only Linux Platform support .
then , image tcmalloc or jemalloc equally , adopt LD_PRELOAD Mount its own implementation . Here we assume that the running is Heap Profiling with jemalloc The same section with memory leaks Rust Program :$ LD_PRELOAD=./libbytehound.so ./demo
Next, a... Will be generated in the working directory of the program memory-profiling_*.dat file , This is bytehound Of Heap Profiling product . Be careful , And others Heap Profiler The difference is , This file is constantly updated , Instead of generating a new file every specific time .
Next, execute the following command to open a web The port is used to analyze the above files in real time :$ ./bytehound server memory-profiling_*.dat
Bytehound GUI The most intuitive way is to click the... In the upper right corner Flamegraph Look at the flame diagram :Bytehound Flamegraph It is easy to see from the picture that demo::func1 And demo::func2 Memory hotspots .
Bytehound Provides a wealth of GUI function , This is one of its highlights , You can refer to the document to explore by yourself .
Implementation detailsBytehound It also replaces the user's default malloc Realization , but bytehound There is no memory allocator implemented by itself , It's based on jemalloc It's packed .// entrance
#[cfg_attr(not(test), no_mangle)]
pub unsafe extern "C" fn malloc( size: size_t ) -> *mut c_void {
allocate( size, AllocationKind::Malloc )
}
#[inline(always)]
unsafe fn allocate( requested_size: usize, kind: AllocationKind ) -> *mut c_void {
// ...
// call jemalloc Memory allocation
let pointer = match kind {
AllocationKind::Malloc => {
if opt::get().zero_memory {
calloc_real( effective_size as size_t, 1 )
} else {
malloc_real( effective_size as size_t )
}
},
// ...
};
// ...
// Stack backtracking
let backtrace = unwind::grab( &mut thread );
// ...
// Record the sample
on_allocation( id, allocation, backtrace, thread );
pointer
}
// xxx_real link to jemalloc Realization
#[cfg(feature = "jemalloc")]
extern "C" {
#[link_name = "_rjem_mp_malloc"]
fn malloc_real( size: size_t ) -> *mut c_void;
// ...
}
It seems that every time malloc Stack backtracking and recording will be carried out at all times , No sampling logic . And in the on_allocation hook in , The allocation record was sent to channel, By a unified processor Thread for asynchronous processing .pub fn on_allocation(
id: InternalAllocationId,
allocation: InternalAllocation,
backtrace: Backtrace,
thread: StrongThreadHandle
) {
// ...
crate::event::send_event_throttled( move || {
InternalEvent::Alloc {
id,
timestamp,
allocation,
backtrace,
}
});
}
#[inline(always)]
pub(crate) fn send_event_throttled< F: FnOnce() -> InternalEvent >( callback: F ) {
EVENT_CHANNEL.chunked_send_with( 64, callback );
}
and EVENT_CHANNEL The implementation of is simple Mutex<Vec<T>>:pub struct Channel< T > {
queue: Mutex< Vec< T > >,
condvar: Condvar
}
Performance overhead In this section, let's explore the various Heap Profiler Performance overhead , The specific measurement methods vary from scene to scene .
All tests are run separately in the following physical machine environment :
Go stay Go Our measurement method is to use TiDB + unistore Deploy a single node , in the light of runtime.MemProfileRate Adjust the parameters with sysbench Take measurements .
Relevant software version and pressure measurement parameter data : The resulting data :
Comparison “ Don't record ” Come on , Whether it's TPS/QPS, still P95 Delay line ,512k The performance loss of sampling records is basically 1% within . and “ Full volume record ” The performance overhead is consistent with “ It will be very high ” The expected , But it was unexpectedly high :TPS/QPS Shrunk 20 times ,P95 The delay has increased 30 times .
because Heap Profiling It is a general function , We cannot accurately give the general performance loss in all scenarios , Only the measurement conclusion under a specific project is valuable .TiDB It is a relatively computing intensive application , Memory allocation may not be as frequent as some memory intensive applications , Therefore, the conclusion ( And all subsequent conclusions ) For reference only , Readers can measure the cost of their own application scenarios .
tcmalloc/jemalloc We are based on TiKV To measure tcmalloc/jemalloc, The method is to deploy one on the machine PD Process and a TiKV process , use go-ycsb Carry out pressure test , Key parameters are as follows :threadcount=200
recordcount=100000
operationcount=1000000
fieldcount=20
Start up TiKV Use... Separately before LD_PRELOAD Inject different malloc hook. among tcmalloc Use default configuration , It's like Go Of 512k sampling ;jemalloc Use the default sampling policy , And every allocation 1G Heap memory is dump One copy profile file .
Finally, the following data are obtained :tcmalloc And jemalloc The performance of the team is almost the same ,OPS Compared with the default memory allocator 4% about ,P99 The delay line has risen 10% about .
We've learned before tcmalloc Implementation and Go heap pprof The implementation of is basically the same , But the data measured here are not quite consistent , The reason is that TiKV And TiDB There are differences in memory allocation characteristics , This also confirms what I said earlier :“ We cannot accurately give the general performance loss in all scenarios , Only the measurement conclusion under a specific project is valuable ”.
bytehound We didn't bytehound And tcmalloc/jemalloc The reason for putting them together is in TiKV In practice bytehound You will encounter deadlock problems during the startup phase .
Because we speculate bytehound The performance overhead will be very high , Theoretically, it cannot be applied to TiKV Production environment , So we just need to confirm this conclusion .
notes : It is speculated that the reason for the high performance overhead is bytehound Sampling logic not found in code , Every time the collected data passes channel Sent to the background thread for processing , and channel It's just simple to use Mutex + Vec Encapsulated the next .
We choose a simple mini-redis Project to measure bytehound Performance overhead , Because the goal is only to confirm whether TiKV Requirements for use in production environment , Instead of accurately measuring data , So we just make a simple statistics and compare it TPS that will do , Specifically driver The code snippet is as follows :var count int32
for n := 0; n < 128; n++ {
go func() {
for {
key := uuid.New()
err := client.Set(key, key, 0).Err()
if err != nil {
panic(err)
}
err = client.Get(key).Err()
if err != nil {
panic(err)
}
atomic.AddInt32(&count, 1)
}
}()
}
Turn on 128 goroutine Yes server Read and write , Once read / Writing is considered a complete operation, Only the number of times is counted , No measurement delay and other indicators , Total number of final uses divided by execution time , Get turned on bytehound The difference between before and after TPS, The data are as follows : From the results TPS lost 50% above .
What can BPF bring although BPF Low performance overhead , But based on BPF To a large extent, we can only get the indicators at the system level , In the ordinary sense Heap Profiling Statistics need to be made on the memory allocation link , But memory allocation tends to be hierarchical .
for instance , If we advance in our own program malloc A large heap of memory is used as a memory pool , I designed the allocation algorithm , Next, all heap memory required by business logic is allocated from the memory pool , So the existing Heap Profiler It won't work . Because it only tells you that you applied for a large amount of memory during the startup phase , The number of memory requests at other times is 0. In this scenario, we need to invade the memory allocation code designed by ourselves , Do... At the entrance Heap Profiler What to do .
BPF The problem is similar , We can hang a hook on brk/sbrk On , When the user state really needs to apply to the kernel for heap memory expansion , For the current stack trace For recording . However, the memory allocator is a complex black box , Most often trigger brk/sbrk The user stack is not necessarily the user stack that causes the memory leak . This needs to be verified by some experiments , If the results are really valuable , It will be BPF As a low-cost bottom plan, long-term operation is not impossible ( Need extra consideration BPF Permission issues for ).
as for uprobe, Just non intrusive code implantation , about Heap Profiling It still has to be in allocator Take the same logic , This brings the same overhead , And we are not sensitive to code intrusion .
https://github.com/parca-dev/parca Based on BPF Of Continuous Profiling, But it really took advantage of BPF In fact, there are only CPU Profiler. stay bcc-tools There is already a Python Tools are used to do CPU Profiling(https://github.com/iovisor/bcc/blob/master/tools/profile.py), The core principle is the same . about Heap Profiling There is not much reference significance for the time being .边栏推荐
- CSDN auto sign in
- Tencent cloud recruitment order sincerely invites ISV partners for customized development!
- CVPR2022 | 可精简域适应
- November 20, 2021: the start and end times of a movie can be listed in a small array
- IIS installation and setup
- If you want to open an account for stock trading, is it safe to open an account online-
- ShardingSphere-proxy-5.0.0容量范围分片的实现(五)
- 2021-11-19:[0,4,7]:0 means that the stone here has no color. If it turns red
- [planting grass by technology] 13 years' record of the prince of wool collecting on the cloud moving to Tencent cloud
- Dart series part: asynchronous programming in dart
猜你喜欢

一次 MySQL 误操作导致的事故,「高可用」都顶不住了!
![[shutter] how to use shutter packages and plug-ins](/img/a6/e494dcdb2d3830b6d6c24d0ee05af2.png)
[shutter] how to use shutter packages and plug-ins

CVPR2022 | 可精简域适应
![[machine learning] linear regression prediction](/img/74/9b5067bb9057049c998898ff2457f1.png)
[machine learning] linear regression prediction

Shardingsphere-proxy-5.0.0 implementation of capacity range partition (V)

WinSCP和PuTTY的安装和使用

Handwritten digit recognition using SVM, Bayesian classification, binary tree and CNN

13 `bs_ duixiang. Tag tag ` get a tag object
![[flutter] comment utiliser les paquets et plug - ins flutter](/img/a6/e494dcdb2d3830b6d6c24d0ee05af2.png)
[flutter] comment utiliser les paquets et plug - ins flutter

An accident caused by a MySQL misoperation, and the "high availability" cannot withstand it!
随机推荐
Server performance monitoring: Best Practices for server monitoring
用一个软件纪念自己故去的母亲,这或许才是程序员最大的浪漫吧
Everything I see is the category of my precise positioning! Open source of a new method for saliency map visualization
[redis advanced ziplist] if someone asks you what is a compressed list? Please dump this article directly to him.
Cross domain and jsonp
One article introduces you to the world of kubernetes
Architecture solutions
跨域和JSONP
Forward design of business application data technology architecture
Longest substring without duplicate characters
jdbc
ShardingSphere-proxy-5.0.0容量范围分片的实现(五)
[technology planting grass] talk about the system design and architecture of large-scale shopping platform
How to improve program performance
What is the website domain name trademark registration process? What is the use of a website domain name trademark?
S2b2c e-commerce platform in the pharmaceutical and medical industry enables enterprises to grasp differentiated competitive advantages and improve supply chain efficiency
How to self-study website construction is website construction company reliable
Echo framework: implementing timeout Middleware
Solve the problem that Base64 compressed files are extracted with spaces after post request
Tencent cloud recruitment order sincerely invites ISV partners for customized development!