当前位置：网站首页>Why choose b+ tree as storage engine index structure

Why choose b+ tree as storage engine index structure

2022-06-24 11:08:00 【jaydenwen123】

Why choose b+ Tree as storage engine index structure

In the world of databases or storage , The role of the storage engine has always been central . To put it simply , The storage engine is mainly responsible for how to read and write data . I said , How fast 、 Read and write data efficiently , It's always the key problem for storage engine to solve . In most of the introduction 、 In books or articles on storage engines , We all default that the disk storage engine with more read and less write uses b+ Trees , And very few people analyze choices b+ Behind the tree as an index structure , What kind of thinking and weighing ？ To answer these questions , This paper tries to discuss with you from a new perspective ：
In the case of more reading and less writing , Why disk based storage engines choose to use b+ Tree as the index structure ？ I hope after reading this article , Readers can have a new understanding of the problem and their own answers . Limited to personal ability , There are expressions 、 Understand the improper, hope to criticize and correct .

The content of this paper is mainly in the form of question and answer , Step by step analysis 、 solve the problem , This article will focus on the following three issues . Before you start reading this article , You might as well try to answer the following three questions by yourself ！

caption

In order to reduce the reader's doubts , Before we start this article , First of all, I'd like to make a brief explanation of the key terms with the readers ：

1. Storage engine ： Storage engine is a very broad category , There's a page based structure that deals with read more and write less b+ Tree storage engine , There are also rising stars based on log structure (lsm Trees ) Storage engine for . The storage engine mentioned in this article , If there is no special instruction , All means For processing read more write less scenarios disk based b+ Tree storage engine , This kind of storage engine appears frequently in relational database , The classic is mysql Medium innodb, Besides golang Compiling boltdb It also belongs to the scope of this paper .

2. Extended content ： The chapters marked as extended content in the text , For readers with relatively good foundation, these contents can be read selectively , Not reading will not make it difficult to understand the core content of this article ; For small partners with relatively general foundation , You can read selectively .

Now let's try to answer the first two questions , Because the first two problems can be regarded as a large class of problems . The first problem is the selection of data structure . The second problem is the distinction between causation and causation .

caption

1. background

The answer to this question , Where do we start ？ Think about it and think about it , Only by making clear the overall background , Only in this way can we know what kind of problem the engineers faced at that time . Close and close , We can try to explain the problem from the root . From the overall environment , The problems they want to solve are mainly faced with the following four main characteristics ：

1. Deal with more reading and less writing

2. Relational databases are organized by rows

3. Store tens of millions of data

4. With cost-effective storage

Next, let's explain them one by one , Because if the above four backgrounds don't hold water , It shows that our starting point is wrong , The answers are all nonsense .

1.1 Deal with more reading and less writing

Bring up the subject , We have to say , In the early days of the Internet , Most systems deal with Read more and write less Scene . For example, the early bbs Content 、 Blog 、 Portal site 、 E-commerce goods warehousing and so on , These scenes can be divided into reading more and writing less . They write data to the database through a limited number of write operations , And then for users to browse many times 、 read . Today's Internet , Many user oriented systems are still in the category of reading more and writing less . therefore Read more and write less This is the biggest background that an early storage engine faced when reading and writing data .

1.2 Relational databases are organized by rows

In the early days, the concept of storage engine mainly appeared in relational databases , Most people are exposed to this concept because mysql,mysql The term storage engine is often mentioned in . In a relational database , Data tends to pass through database -> surface ( Multiple columns )--> That's ok The way to organize . When it comes to the storage engine , The data has been organized in line format . In a broad sense, it is similar to key-value Of storage , Just in a relational database , To the storage engine level value Is a line of records in accordance with the specified format to flatten the organization , The specific format is similar . I won't go into details here . You can search for information by yourself , This is mainly thrown out in the relational database Data is stored in row format The background .

For the convenience of introduction , In the following content , The data stored in the storage engine is unified according to key-value To discuss . Here key We can intuitively understand it as primary key for the moment .

1.3 Store tens of millions of data

With the rapid development of the Internet , The scale of data storage is growing , It'll be there soon Store tens of millions of data This magnitude . Obviously, from the perspective of demand , It's actually an iterative process . It's impossible to come with the Internet , It's going to be this magnitude right away . But we also know that in the world of computer software , Extensibility It's a familiar word . So when designing a storage engine , Naturally, I will definitely think about it . So here , We will Store tens of millions of data This is the third background .

1.4 With cost-effective storage

And then the third background , Naturally, the question of where the data is stored comes up . Answer the question , It has to lead to a cost issue . If you don't consider the cost to store , Then the problem of nature is simplified . But when you store tens of millions of data , With cost constraints , It's time to think .

Our goal is to find a relatively cheap one , But it can also support the storage medium of tens of millions of data .

For computers , The media that store data as a whole can be divided into two categories ：

1. Volatile storage ： Typical memory

2. Nonvolatile storage ： Typical hard disk ( disk ), Especially the early mechanical hard disk

A detailed comparison of the two , You can refer to the figure below ：

caption

First , Through the comparison of the above figure , We can be pretty sure , The storage medium we expect is the hard disk ( Mainly mechanical hard disk ) 了 . Because it's very consistent with the characteristics we're looking for . But we all know that hard disks meet our requirements , But hard disk has its inherent structural limitations . Access to disk is much slower than access to memory .

I have to mention it here , About the mechanical hard disk, it's made up of . A brief introduction to mechanical hard disk , Let's make a brief introduction in the following extension , You can read if you are interested , The readers who have mastered can skip the contents in the dotted box directly .

Extended content

caption

The picture above about the disk is taken from This article .

Ordinary mechanical hard disk reads and writes data mainly by moving the head to the corresponding track , Then rotate the head to the corresponding sector . Finally, move the magnetic head to read and write data .

In short ： A hard disk data reading and writing mainly includes three parts ： Seek time 、 Rotation time 、 Transmission time . And in this case Seek time It's the big one , The main reason is that the movement of the magnetic head is that the motor drives the magnetic arm to move the magnetic head , This movement is mechanical , So the speed is relatively slow .

For disks , Disk access is definitely slower than memory . But in disk access , There are two more ways of access ：

1. Random IO

2. The order IO

The order IO It's much faster than random access IO, The fundamental reason is that ： The order IO It mainly reduces the moving frequency of the magnetic head , It means less Seek time . So its performance is better than random IO Much faster .

Due to limited space , About the introduction of the hard disk, we just expanded , Otherwise, it would be out of the question . With the above knowledge , We will be able to carry out our follow-up content . About the details of the hard disk , You can find other materials to study or Click on this article To read . Now let's continue to introduce our main content .

secondly , Now that we have chosen the hard disk as the storage medium , Then we must find a way to overcome this problem . Here's a detailed description of Memory access speed and disk access speed The results of the comparison .

caption

Let's briefly summarize , Throw out what we got here Conclusion ：

> Conclusion 1 Please refer to the extension for details .

1. Disk access time ： Seek time + Rotation time + Transmission time ：

> Seek time ：8ms~12ms

> Rotation time ：7200 turn /min： Half a week 4ms

> Transmission time ：50M/s, about 0.3ms

2. Disk random IO ≪ Disk order IO ≈ Random memory IO ≪ Memory order IO

3. Mechanical disk and solid state disk ：

Mechanical drive ： Electromagnetic storage , Control reading and writing through electromagnetic signal transformation , The head manipulator moves undefined Solid state disk ： Semiconductor storage , With solid state electronic memory chip arrays 、 It consists of control unit and storage unit , Internal by The composition of flash memory particles . It's faster than

1.5 summary

This section mainly explains 4 It's a big background , Let's review .

1. Deal with more reading and less writing

2. Relational databases are organized by rows

3. Store tens of millions of data

4. With cost-effective storage

Finally, we combine the actual scene selection to Hard disk This medium is used to store data , At the same time, in the storage engine layer , The data follows the abstract level of key-value To organize reads and writes . Got the big background , Now we need to know what our goal is . No goal, no direction , It's important for us to understand our goals .

2. The goal is

In the first section , We introduced four basic backgrounds . And we finally need to take hard disk to store data . In this section , We're going to focus on what we want The goal to achieve , Only clear goals , We can better carry out top-down task decomposition and solve problems .

Before introducing our goals , Let's take a look at what we're doing Based on the condition of disk storage data , What is a regular user request like ？

2.1 A normal user request response process

We all know , Our data is stored on the hard disk , So when a user's request ( read / Write ) When you come in , First, it goes to the memory managed by the operating system , Do some logic processing in memory , then cpu Will send instructions to read data from the hard disk into memory , Finally, it will respond to the upper users ( read ： Read data 、 Write ： Write data is successful, etc ).

A general process described above . We can see that the whole process goes through three stages ：

Request process ：

User request -> Memory -> Hard disk

The response process :

Responding to users <- Memory <- Hard disk

After sorting out the whole user's request response process , Let's have a look , What is our ultimate goal ？

2.2 The goal is

In fact, the application of the Internet , There's a subtle rule , That's it Efficient 、 Fast , It's no exception for storage engines , Our goal is to carry out in the above context Fast 、 Efficient data read and write requests .

The problem is coming. ！ Fast 、 Efficient These are all descriptive terms of qualitative analysis , How to calculate fast ？ How is it efficient ？ How to analyze the demand quantitatively ？

Come here , Let's think about it. If you are responsible for this demand , So from what angle would you think about it ？

This problem should not be difficult for you , Remember there's an indicator in the data structure and algorithm ！ Time complexity , This is a very important core indicator , A key indicator to measure the efficiency of data structure or algorithm processing . We think about , Our data will eventually be stored , How to store , How to organize , It's about choosing which data structure to use ！ Then the above problems are further extended to , What kind of data structure is used to organize our data , And make it read and write as fast as possible 、 Efficient . Specific fast 、 The efficiency is determined by the time complexity .

2.3 summary

In this section, we summarize and review the previous two parts through a block diagram . Which data structure to choose? Let's move on to the next section .

caption

3. Data structure selection

stay 2.2 The section mentioned , We will eventually Fast 、 Efficient reading and writing The problem turned into Choose which data structure to use to organize the data 、 And through its time complexity to quantitatively determine our goal . Let's start with data structure .

3.1 Data structure scheme comparison

Let's analyze it in detail , Fast 、 Efficient That means the average time complexity of reading and writing , As low as possible . We've learned a lot about data structures in data structures , for example ： The average time complexity is O(1) Data structure of , Typical examples are hash tables (hash table). Hash table is mainly used to read and write point-to-point mapping. When the conflict is relatively low, the efficiency is very high , But its native data structure is in the face of range finding 、 The time complexity of sorting and other operations will be relatively high , This is also a big drawback of hash tables ; Second, the average time complexity ratio O(1) The slower one is that the average time complexity is O(logn), This kind of data structure has binary search / Sort tree (bst tree)、 Balanced binary trees (avl tree)、 Red and black trees (rb tree)、b Trees (b/b- tree)、b+ Trees (b+ tree)、 Jump watch (skiplist) etc. . They naturally support sequencing 、 Range lookup operation ; Next to O(logn) And the time complexity is O(n)、O(n^2) wait . O(n) Typical examples of the average time complexity of are arrays . We have just expanded other types here .

In the figure below, we start from... According to the average time complexity O(1)->O(logn)->O(n)->... A comparison from fast to slow .

caption

We all know that in the application of the Internet , Sort 、 Range finding is a very common requirement . For example, when shopping on e-commerce websites , Most users will subconsciously Click to order sales from high to low ; Another example is on the portal news website , We will pay attention to the number of times a news item has been read in the past week , Sort by time ; Another example is the recommendation system , We sort items according to one or more attributes of the content , There are many, many examples . So when we choose the data structure , Must consider Support range lookup 、 Sorting and other operations .

Based on that , It seems that hash table is not suitable for our needs . Let's go back to the next , Look again. O(logn) In the time complexity of , Which data structure do we choose ？ At a glance, these data structures seem to meet our needs , At the same time, the above data structures : Binary search / Sort tree (bst tree)、 Balanced binary trees (avl tree)、 Red and black trees (rb tree)、b Trees (b/b- tree)、b+ Trees (b+ tree)、 Jump watch (skiplist) It can be implemented in memory , How do we choose ？ Intuitively, it seems that we can choose any one , But let's not forget , Our data will eventually go to the hard disk , Since there is no conclusion here , Let's look at it from the perspective of hard disk , Which data structure on the hard disk is easy to maintain ？

3.2 Turn to disk

According to the previous introduction , Our data flow is divided into three stages , Take the read operation as an example ： disk -> Memory -> user . In that case , our The intuitive idea is , If you can maintain the same data structure on both memory and hard disk , That would be perfect , So when the user requests in , After loading data from disk , It can be loaded directly into memory , Then do some processing and return to the user . If intuitive thinking doesn't work ( No such data structure can be found ). Then we can only follow The indirect way of thinking starts , A data structure is maintained on the hard disk to store our data , Select another data structure in memory to save the data . When reading data from the hard disk into memory , There's a layer of conversion in the middle . Indirect thinking is the worst way , Because the conversion of intermediate data will inevitably lead to some loss of performance .

Let's start with intuitive ideas , Is there such a data structure ？

3.3 Starting from intuitive thinking

Let's think about it first , Since our goal is still Fast 、 Efficient reading and writing , That corresponds to the disk , How to make the disk fast 、 Efficient reading and writing ？

According to the previous Introduction , Everyone should know that, then try to use the disk as much as possible The order IO Chant . Speaking of order IO, The first thing that pops out of my head is Additional writing , Because this way is a typical sequential writing 、 The order IO, Very high performance . Let's assume that every write request comes in , They all use the method of appending and writing to save data . In this way, the write operation is fast 、 Efficient . How to read it ？

According to the previous introduction , The data is based on key-value To flatten storage . Each record has a different length , Since you want to make sure you read , You have to save some extra information to assist in processing the user's read request . These extra saved data , We call it Indexes . Let's think about it , In this scenario of additional writing , What information do we need to save to complete the normal read request ？ Actually For each record, we just need to know where it is written on the disk ( Offset )offset、 How long did it take size These two messages . So we can read it . In short , A record corresponds to such a binary index information . The simple diagram is as follows, so ：

caption

Come here , Efficient writing is OK , After maintaining the index , The reading of a single record is OK ; But there's a problem we need to figure out what to do ？ That's the sort we mentioned earlier 、 Range lookup operation .

In this case , We add every record directly , So the data itself is stored out of order on the disk , Since we need to Sort 、 Range lookup Words . Then you have to load all the records on the disk into memory , And then go through it one by one to judge , Finally, the records that meet the conditions are filtered out and returned to the user . It's a way to function , But obviously it's too inefficient . At the same time, the data stored on the disk may far exceed the memory size , So this way is not desirable at all . Is there any way to solve the problem ？

Let's make a little assumption ： Let's say that when we write to disk, we can ensure that we write in sequence at the same time , The data written is ordered . such as , We wrote three pieces of data , The three pieces of data are written in order , So when you look up the range , We just need to go to the first data , The data behind is ordered , You can read in order very quickly . If the assumption holds , That sort of 、 The problem of range finding is fundamentally simplified . We don't have to go through so much trouble . Let's start with this simple assumption , Under the assumption that , What else do we need to solve ？

In this mode , We visit each record and still need to keep the previous conclusion ： Each data maintains an index entry ：offset、size.

caption

We want to store tens of millions of data , Every record needs an index entry , Then ten million records need to maintain ten million index entries . And then there's the problem , How to organize these ten million index entries ？ Choose which data structure to organize ? Where to save ？...

For the problem of tens of millions of index entries , Let's see if there is a solution to this problem . Intuitive ideas can be divided into two categories ：

Can we reduce the number of index entries ？ The number of index entries is reduced , Natural problems are easy to solve
The number of index entries cannot be reduced , Is it possible to find Reasonable data structure To organize . here “ reasonable ” Can be interpreted as ： Space compression 、 Optimization, etc .

Let's start with the first idea ！

Q: Why do you generate tens of millions of index entries ？

W: Because every record needs to maintain an index entry , We need to keep thousands of records , So you have to store tens of millions of index entries .

Q: Why does each record need to maintain an index entry ？

W: Because every record is passed in from the user request , When each record is flattened in line format , The length is not fixed , Therefore, each record needs to maintain an index entry .

Here we know the core cause of the problem .

So far, let's summarize the above deduction with a diagram ：

caption

3.4 Index contradictions

The core contradiction of index : According to the previous analysis , Each record is Lengthening Of , So you need to maintain an index entry for each record . Lengthening 、 Lengthening 、 Lengthening , Can we make some improvements or optimizations from the dimension of lengthening ？ Since we can't control the length of each record , Can I convert the disk ？

If we divide the disk into segments of fixed size contiguous blocks , For each piece , We record a block number no To distinguish between different blocks , Suppose our block size is 100 Bytes , So the first 0 The corresponding range of the block is 0~99, The first 1 The block corresponds to 100~199, By analogy . What will happen with such a change ？ Let's analyze in detail .

After dividing the disk into fixed size contiguous blocks one by one , Each block still retains the two features of the original process ： The data is ordered and written in order . The general result is shown in the figure below, so ：

caption

After this transformation , The key is to see how to ensure reading and writing ？

Let's assume that our block space is large enough , In this way, we can avoid the problem that one block cannot store another record .

Under normal circumstances , We have multiple records in one block , And the records in the block are stored in order . When we read a record , A record must be on one of them , First of all, we have to solve the positioning problem . When you locate a specific block , Load the data of the current block from disk into memory , The data inside the block is stored in order , Naturally, we can find the index item corresponding to our specific data by dichotomy . Finally, read the data according to the index entries . Similarly, the process of writing is to write a single record , But internally, the disk is written in blocks .

Then the problem becomes How to determine which block a record is stored in ？

In response to this question , We just need Determine the specific range of multiple records stored on a block . For example 0 The block holds id from 0~10 The record of ; The first 1 The block holds id from 11~23 The record of . wait

In this case , When the query id by 7 When recording , We'll soon know that the record is stored in page 0 On the block , And then from the 0 Look inside the block for specific id by 7 The index entry of the record , Finally, read the data .

How to be sure ？ Naturally, only one block number needs to be saved no On the basis of , Save two more messages ： The minimum number of records saved by this block min、 The maximum number of records saved by this block max.

Each block corresponds to such a triple block->(no,min,max).

What this triplet means is ： The first no The range of records saved by the block is min~max

Let's think about it again , In fact, there is room for improvement in this triplet . Because when we write , Each block is written sequentially and the data within the block is ordered , The blocks are also ordered . That means ： For the first i In terms of blocks , The first i The record range of block storage is the second i The minimum value of the block is spliced on the i+1 The minimum value of the block . The fundamental reason is that the blocks are orderly when they are stored . Further, we can simplify the above three tuples into a two tuple block->(no,min). At the same time, it also ensures that the data stored between each block is logically ordered .

I've talked a lot about it , Let's conclude ：

The introduction of partitioning disks into one Fixed size continuous block The concept of
The data in the block is still in order 、 Sequential write storage ： The block still holds two pieces of information for each record ： Where the record is written to the disk offset、 How long does this record occupy size
The data between blocks is ordered , Each piece needs to hold two pieces of information ： Block number no、 The minimum record value that the block holds min

After introducing the concept of this block , Let's look at when performing a range lookup 、 When sorting and other operations , In most cases, it can be reduced IO The number of times , Because one piece of data is read at a time , And the data in a block contains multiple records . If the data involved is in one block , Multiple pieces of data only need to be read from the disk once . So in this scenario , Performance will also improve .

The overall structure is shown in the figure below ：

caption

meanwhile , We still have two remaining problems to solve :

1. How big is the block ？

2. Whether the block index exists or not ？ How to save ？

For the first question : How big is the block ？, We can see dialectically .

If the block size is larger , So the more data a piece can hold , So the less blocks we need at the same data level , Near and in need of maintenance The fewer block indexes . But when you read and write each record, the more space you have to read and write ( Read and write in blocks ), So the lower the efficiency of reading and writing .

If the block size is smaller , So the less data a piece can hold , So the more blocks we need at the same data level , Near and in need of maintenance The more block indexes there are . But when reading and writing each record, the extra space for reading and writing will be smaller ( Read and write in blocks ), So the higher the efficiency of reading and writing .

I can see it here at last , Actually How big is the block It's a matter of compromise . So how do we choose ？

Don't forget , Our data is stored on disk , At the same time, our application runs on the top of the operating system , We're here thinking How to make the operating system serve our applications better ？ In short, make better use of the features of the operating system , Make it the most valuable . Every time we read and write data, we will involve disk operation , For example, read and write disks 、 Brush plate, etc , And before the data is written to disk , The data will be written to memory first , When managing memory in the operating system , There is one page The concept of . The operating system reads and writes disks 、 It's all about timing Managing pages of memory There is an inseparable relationship . So in order to make better use of the operating system , Take the operating system page as the basic unit to determine the size of the block , The simplest is that the size of a block is equal to that of a page ( Default 4k). Bigger can be 8k、16k、32k wait .

Actually, here it is , We'll just remember ,innodb The default page size in is 16k; And in the boltdb in , The default page size is the operating system page size 4k.

Since the operating system page is selected as the basic unit of block size , Then we don't introduce a new concept block 了 , We also Call blocks pages Chant . Reduce the cost of understanding new terms .

The first question is , So here we are . Now let's look at the second question .

Whether the block index exists or not ？ How to save ？

Our answer is save , Because if it doesn't exist , When our app restarts , You need to rebuild the block index , When the amount of data stored is large , Rebuilding the block index is quite time-consuming , During index rebuild , It may cause the service to be unavailable . So we need to store the block index . How to store it ？

The first one is ： The simplest way is to partition independent blocks to hold fast indexes

 This way in mysql It's also called ** Cluster index **, Index and record data are stored together , Stored in a file .

The second kind ： Save the fast index in a separate file

 This way in mysql It's also called ** Nonclustered index **, Index and record data are stored separately , Stored in different files .

3.5 b Tree or b+ Trees

Here we are , Our whole derivation is almost over , Let's summarize the above derivation , The final result is shown in the figure below .

caption

Each dotted box in the figure above represents a page , Each of these pages holds data ( Data items or index items ), There can also be pointers between each page to ensure that the pages are logically ordered . Secondly, each page contains multiple data items or index items , And the data is stored in order . If we take out the black line , What kind of structure is the remaining structure ？

The answer is : Multiforked tree , Each page can be regarded as a node , There are a number of , At the same time, a node can have more than one child node

Now let's think about one more question . Now we can read and write based on this structure . So let's see , When we read it , If the data read is exactly the smallest record saved on one of the pages , At this time, if our minimum record keeps the data , You can go straight back to . Instead of going down . If only one minimum record keyword is saved , Then you have to go all the way down . Let's have a look What are the differences between the original data of the index item in each page with or without the record ？

caption

According to the analysis above , We can see , If the original data is saved in the corresponding page index entry , Then it corresponds to b Trees Data structure of ; If you don't store raw data , Then it corresponds to b+ Trees Data structure of . The analysis makes clear the difference between existence and non existence , So we choose to exist or not ？

The answer is : Do not save , Because a page of the same size , If the page index entry stores additional raw data , Inevitably, the number of page indexes stored in a page will decrease . At the same time, it will further lead to the storage of the same level of data , The height of the stored tree is much higher than that of the non stored tree . And the height of the tree in our scenario actually corresponds to the disk IO The number of times . Obviously we're going to Fast 、 Efficient , Then try to reduce unnecessary disks as much as possible IO frequency . So it doesn't exist . Close and close , So we decided that the data structure we chose was b+ Trees .

Here we are , Let's start from the initial intuitive thinking , Found a data structure that is easy to maintain on disk ：b+ Trees .

In our abstraction and improvement , Introduced the concept of page , Data on disk is organized in page units , The data stored inside the page is stored in order , Data between pages is also stored in order .

At the same time on disk b+ In the tree , Non leaf nodes hold page index information . These include ( Page No no、 This page holds the smallest record of data min); The leaf node holds the specific record data .

Now that it's on disk b+ The tree is stored , That's the natural memory b+ Tree implementation . Let's see how the two transform into each other .

In the memory b+ The node of the tree corresponds to a page on the disk . In memory, multiple items inside a node correspond to each element on each page of the disk ( Every record data or Index entries per page ).

Here's another point ： We choose to use b+ Trees as indexes instead of b The core of the tree as an index is , In the case of storing the same amount of data , Choose to use b+ When trees are indexed , To compare b Tree index . Average disk IO Less times . At the same time b+ Trees , The time complexity of different requests is relatively average . Because the data of each record is saved on the leaf node .

3.6 summary

Here we try to answer Why choose b+ Tree as storage engine index structure That's the end of the question . Let's sum up with a picture ：

caption

Finally, let's look at the structure of the data b+ What does a tree look like , We disk + In memory b+ What does a tree look like .

The figure below is in the data structure b+ Trees , Here we will not explain it too much b+ The characteristics of trees have changed .

caption

Here's the disk + The last corresponding in memory b+ Tree diagram .

caption

Last , In the next section, we will try to answer the third question to prove the scheme we have chosen .

4. Reverse argument

Now that I have chosen to use b+ Tree to store the index structure of tens of millions of data , For a storage engine with a specified page size ,3 Layer or 4 Layer of b+ How many pieces of data can a tree store ？

Through this question , Let's prove it again , Whether the solution we choose can solve the problem we mentioned at the beginning Store tens of millions of data This problem .

4.1 3 layer 、4 layer b+ Trees (innodb For example ) How much data can each store ?

caption

In response to this question , How can we make a rough estimate ？

We all know innodb in , The default page size is 16k, Here we also take 16k To make an approximate estimate .

Before estimating , We need to assume two data in advance :

Suppose that the page index entries saved in non leaf nodes , The size of each item accounts for 16 Bytes . about 16k On the other hand , It can store about 1000(16k/16) Index entries .
Suppose that the average size of each record data stored in the leaf node is 160 Bytes . about 16k On the other hand , It can store about 100(16k/160) Bar record .

Sum up ：

about 3 Layer of b+ Trees , The amount of data it can store ：1000 *1000 * 100, Probably 10^8 strip

about 4 Layer of b+ Trees , The amount of data it can store ：1000 * 1000 * 1000 * 100, Probably 10^11 strip

in other words , One 3 Layer of b+ Trees , In this scenario , The amount of data that can be stored is about 10 million . So this solution can solve the problem that we first raised . Here's a simple summary .

caption

4.2 summary

Here we are , We have just answered three questions . And by answering these three questions , This paper discusses the choice when facing the scene of more reading and less writing b+ Some of the selection processes behind the tree storage engine . It's worth noting that this article is just a bit of thinking and pondering in the process of personal learning . Please correct any misunderstanding or incorrect expression .

原网站

版权声明
本文为[jaydenwen123]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/06/20210607002236387Y.html