当前位置：网站首页>Paging query and join Association query optimization

Paging query and join Association query optimization

2022-06-26 18:04:00 【Time is light, you are safe】

One 、 Paging query optimization

The existing table is as follows ：

CREATE TABLE `employees` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(24) NOT NULL DEFAULT '' COMMENT ' full name ',
  `age` int(11) NOT NULL DEFAULT '0' COMMENT ' Age ',
  `position` varchar(20) NOT NULL DEFAULT '' COMMENT ' Position ',
  `hire_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT ' On entry   between ',
  PRIMARY KEY (`id`),
  KEY `idx_name_age_position` (`name`,`age`,`position`) USING BTREE
)

1.1 scene

This table currently has 10 Ten thousand data , When it comes to paging , Maybe most of my friends will use the following SQL Realization ：

select * from employees limit 100000,10;

But this most conventional way of writing has its drawbacks .
If there's not a lot of data , There's no big problem ;
But if the amount of data is millions or even tens of millions , If you still query in this way , And the query data is relatively backward （ Turned to dozens or even hundreds of pages ）, The more you turn back, the slower it will be .

Why is this so ？
It seems that only a few pieces of data have been checked , however Actually this one SQL Read first 100010 Bar record , And then before I leave 100000 Bar record , Then read back 10 Data you want . Therefore, you need to query the data behind a large table , Execution efficiency is very low .

1.2 Optimization techniques

This paging optimization , You need to look at specific scenarios , Not all apply , Let's introduce it in detail

1.2.1 Paged query sorted by auto increasing and continuous primary key

Prerequisite ： The query is based on the primary key , And the primary key is self increasing and continuous
SQL The following forms can be optimized ：

select * from employees where id > 100000 limit 10;

After the optimization , have access to explain Look at the next two SQL Results of execution .
Insert picture description here
We can see that if you use limit, The query did not take the index , from rows It can be seen that it is equivalent to a full table scan （ Table data share 102834 That's ok , Scanned 102872 That's ok ）

Insert picture description here
Optimized SQL, The primary key index is used , And the number of lines scanned is greatly reduced , Used B+ Tree Index tree of , So the efficiency will be much higher .

shortcoming
The limitations of this optimization are obvious , The primary key is required to be self incremented and continuous , This kind of situation is very rare .
Because in most cases, there will be data deletion , If you delete a piece of data in the middle ,SQL The query has no problem before this record , But if the data is larger than this one , The result set is problematic .
You can simulate this scenario ：
First delete a piece of data , Do two more SQL sentence , View results ：
Insert picture description here
picture ...
You can see two SQL The result set is not the same

therefore , If the primary key is not continuous , The optimization method described above cannot be used . In addition, if the original SQL yes order by Non primary key fields , Rewriting according to the above method will lead to two SQL The results are inconsistent .
So this rewriting is The following two conditions are met ：
The primary key is self increasing and continuous
The result is sorted by primary key

1.2.2 Paging queries sorted by non primary key fields

There are the following SQL：

select * from employees ORDER BY name limit 90000,5;

Use Explain View analysis ：
Insert picture description here
Here is also a full table scan , And sorting also uses file sorting ：filesort

       Come here , There may be a small partner who will have questions , This table clearly has idx_name_age_position Indexes , Now we use name Field , Is consistent with the leftmost prefix principle , Why didn't you go through the index , It also uses file sorting ？ How to optimize ？
       The specific reasons have been introduced in the article before ： Scan the entire index and find no index The line of ( You may have to traverse multiple index trees ) The cost of is higher than the cost of scanning the full table , therefore MySQL The optimizer discards indexes .
       In fact, if this kind of statement needs to use an index , You can use the overlay index , Specify the fields to be checked . Then we can use this idea ： Make the fields as few as possible when sorting , Specify the fields to be queried after sorting .
So you can let sorting and paging find the primary key first , Then find the corresponding record according to the primary key ,SQL Rewrite as follows ：

select * from employees e 
inner join (select id from employees order by name limit 90000,5) ed on e.id = ed.id;

Insert picture description here
Then we can use explain To analyze the performance ：

From the results we can see , Used the set index , And the sort also uses the index sort .

On the face of it , The final SQL I checked twice , I checked it once , Finally, I put ID Check the table again , Will the performance be low ？？
The answer is no . First and foremost SQL：select id from employees order by name limit 90000,5 Make full use of the overlay index and still use the index sorting , This article SQL The search results actually represent the final data set ID It has come out and is in good order , Finally, the query performance of the external layer is also very high , Most of the time , Our paging data is 10 Article or 20 strip , The primary key association performance of these data back tables is very high .

Two 、JOIN Association query optimization

The existing table is as follows ：
t1 Table has 10000 Data

CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_a` (`a`)
) ENGINE=InnoDB AUTO_INCREMENT=10201 DEFAULT CHARSET=utf8;

t2 Yes 100 Number of pieces

CREATE TABLE `t2` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_a` (`a`)
) ENGINE=InnoDB AUTO_INCREMENT=101 DEFAULT CHARSET=utf8;

2.1 MySQL Two common algorithms for table Association

Nested-Loop Join Algorithm
Block Nested-Loop Join Algorithm

Let's introduce these two algorithms in detail

2.1.1 Nested loop connection Nested-Loop Join(NLJ) Algorithm

As follows SQL：

select*from t1 inner join t2 on t1.a= t2.a;

Insert picture description here
Use explain Check the query , We can see that it is queried first t2 surface , Then query the t1 surface （ My previous blog has written id Big first ,id Do the same from top to bottom ）.

So why did you check first t2 Watch ？ If you put t1 and t2 Change places , Which table will you check first ？
Insert picture description here
We can see that the result is the same . All are priority queries t2 surface .
Actually ,MySQL Have your own algorithm when choosing , For table associations , An associated field has an index ,mysql The nested loop connection will be selected at the bottom Nested-Loop Join(NLJ) Algorithm （ The bottom layer of this algorithm is not Cartesian product ）.

Nested loop connection Nested-Loop Join(NLJ) Algorithm ： Loop from the first table one row at a time （ be called The driver table （ps: That is to say t2 surface ）） Read lines in , Get the associated field in this row of data , According to the associated fields in another table （ Driven surface ） Take out the line that meets the conditions , Then take out the result set of the two tables .

Summarized below ：
We can see this information from the above implementation plan ：

①、 The drive table is t2, The driven table is t1. The first thing to execute is the drive table ( The results of the implementation plan id If the same, execute from top to bottom sql); The optimizer usually preferentially selects small tables as the driving tables . So use inner join when , The top table is not necessarily the driving table .

②、 Used NLJ Algorithm . commonly join In the sentence , If Implementation plan Extra There is no Using join buffer Indicates the used join count Law is NLJ

above sql The general process is as follows ：

From the table t2 Read a row of data in the ;
From 1 Step data , Take out the associated fields a, To table t1 Search for ;
Take out the watch t1 The line that satisfies the condition in , Follow t2 Merge the results obtained in , Return to the client as a result ;
Repeat the above 3 Step .

The whole process will read t2 All data of table ( scanning 100 That's ok ), Then traverse the fields in each row of data a Value , according to t2 In the table a Value index scan t1 surface Corresponding lines in ( scanning 100 Time t1 Index of tables ps: Scanning the index is very fast , Time is almost negligible ,1 One scan can be considered as only scanning in the end t1 One row of complete data in the table , That's all t1 The watch was also scanned 100 That's ok ). So the whole process scans 200 That's ok .

explain ： Result scan 100+100=200 Why .
Because first scan t2 All data of table , That is to say 100 Scans .
Each time from t2 A record is scanned out of the table , And t1 Table de comparison , Because it is indexed , The index scan time is ignored here , Think from t1 One scan in the table can obtain the available data t2 The data corresponding to the table .
therefore , in other words , from t1 Scan the table once ,t2 Scan the table once , You can get a result set .t2 Total 100 Row data , So the total number of scans 200 Time （ Two tables each 100 Time ）.

If the associated field of the driven table has no index , Use NLJ The performance of the algorithm will be low ( Here's a detailed explanation ),mysql Will choose Block Nested-Loop Join Algorithm

2.1.2 Block based nested loop connections Block Nested-Loop Join(BNL) Algorithm

As follows SQL：

select*from t1 inner join t2 on t1.b= t2.b;

This time, the non indexed field association is used
Insert picture description here
Extra in Of Using join buffer (Block Nested Loop) This indicates that the association query uses BNL Algorithm .
Block based nested loop connections Block Nested-Loop Join(BNL) Algorithm ： hold The driver table （ps: A driver table is a small table , As can be seen from the execution plan ： The driver table here is still t2） Data read into join_buffer （ps: You can use this as a memory , That is to say mysql new A small connection cache area ） in , Then scan Was the driver table , hold Was the driver table Take out each line and follow join_buffer Compare the data in .

above sql The general process is as follows ：

hold t2 Put all the data into join_buffer in
Keep watch t1 Take out every line in the , Follow join_buffer Compare the data in （ The comparison is done in memory . After comparing a piece of data, the memory will be released , Not all data is always stored in memory , So there is no risk of high memory .）
Return to satisfaction join Conditional data

The whole process is on the table t1 and t2 We did a full scan , therefore The total number of rows scanned is 10000( surface t1 Total data of ) + 100( surface t2 Total data of ) = 10100, also join_buffer The data in the is out of order , So watch t1 Each line in , Do it all 100 Second judgment , therefore The number of judgements in memory is 100 * 10000= 100 Ten thousand times .

The associated field of the driven table has no index. Why do you choose to use BNL Algorithm without using Nested-Loop Join Well ？
If the associated field is not indexed sql Use Nested-Loop Join, Then the number of scanning lines is 100 * 10000 = 100 Ten thousand times , This is ScanDisk .
Obviously , use BNL The number of disk scans is much less , Compared to disk scanning ,BNL Memory computing will be much faster .
therefore MySQL For an associated query with no index for the associated field of the driven table , Generally used BNL Algorithm . If there is an index, generally choose NLJ Algorithm , Yes In the case of indexing NLJ Algorithm ratio BNL The algorithm has higher performance

2.2 For Association sql The optimization of the

① The associated fields are , Give Way mysql do join Try to choose NLJ Algorithm
② Small tables drive large tables , Write multi table join sql If Be clear about Which table is a small table that can be used straight_join Fixed connection drive mode , Omit mysql The optimizer's own time .
straight_join：straight_join Same function join similar , But let the left watch drive the right watch , It can change the execution order of the table optimizer for the associated table query .

MySQL The internal query engine will probably optimize SQL, But sometimes the optimization may not be accurate .
For example, the above query ,MySQL Execute after optimization t2 surface , This is the general case , The optimizer can properly optimize , Let the small table execute first . But sometimes it is not necessary to optimize the query for small tables first .

therefore , If in some cases , It is clear which of the connected tables is a small table , You can use straight_join Grammar to make connections , Specify which table to execute first .
Aforementioned SQL It can be rewritten as ：

select * from t2 straight_join t1 on t2.a = t1.a;

So you don't have to MySQL Internal optimization （ Determine which table has a large amount of data , Which data volume is small ）, Clearly tell the optimizer which is the driving table , Let the optimizer choose as little as possible
We need to pay attention to ： Aforementioned t2 It's a small table. , The premise is that the two tables have no conditions . The real small table driving large table does not mean that a table with a small total data volume is the driving table , It is The amount of data that really participates in the association Size determines which is the driver table .

straight_join Only applicable to inner join, It doesn't apply to left join,right join.（ because left join,right join The execution order of the table has been specified on behalf of ）

Let the optimizer judge as much as possible , Because most of the time mysql The optimizer is smarter than people . Use straight_join Be careful , because In some cases, the artificially specified execution order is not necessarily more reliable than the optimization engine .

More optimization , You can go to in and exsits、count(*) Query optimization see .

原网站

版权声明
本文为[Time is light, you are safe]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206261753537446.html