SQL Statement parsing is an important and complex technology , Database traffic related SQL Audit 、 Read / write separation 、 Slicing and other functions depend on SQL analysis , and Pisa-Proxy As Database Mesh A practice of the idea , The management of database traffic is its core , So realize SQL Parsing is a very important job . This article will be Pisa-Proxy Practice as an example , Show you Pisa-Proxy Medium SQL Parsing implementation , Problems encountered and optimization .

One 、 background

About parsing

Parsing is usually done through a lexical analyzer , Such as Flex, Generate corresponding token, The parser parses token, To determine whether the defined syntax rules are met .

Parsers are usually generated by parsing generators .

Syntax analysis algorithms are commonly used as follows :

  • LL( From top to bottom )

Context free grammar , Scan from left to right , Derive the syntax tree from the leftmost , comparison LR Easier to understand , Error handling is more friendly .

  • LR( Bottom up )

Context free grammar , Scan from left to right , Derive the syntax tree from the rightmost node , comparison LL Fast .

  • LALR

And LR similar , When parsing, the LR Fewer states are generated , Thereby reducing Shift/Reduce perhaps Reduce/Reduce Conflict , Widely used in the industry bison/yacc What is generated is based on LALR Parser .

About research

Developing SQL At the beginning of analysis , From performance 、 Maintenance 、 Development efficiency 、 The degree of completion has been investigated in four aspects antlr_rust,sqlparser-rs,nom-sql project , But there are some problems .

ShardingSphere Based on Antlr Different SQL Dialect analysis , In order to use its Grammar, We investigated antlr_rust project , This project is not active enough , Not mature enough .

stay Rust In the community ,sqlparser-rs The project is a relatively mature library , Compatible with all kinds of SQL dialect ,Pisa-Proxy It will also support multiple data sources in the future , However, its lexical and grammatical analysis are all made by hand , It will be difficult for us to maintain .

nom-sql Is based on nom Library implementation of SQL Parser , But not complete , The performance test is not as good as expected .

Grmtools Is looking for Rust dependent Yacc Libraries discovered during implementation , The library is compatible with most Yacc function , So you can reuse MySQL Official grammar file , But you need to write by hand Lex lexing , After weighing the development efficiency and completion , We decided to do something difficult and right , Realize one's own SQL Parser , Quickly implement a Demo To test .

After coding , The test results are good .

Summarized below :

Tools antlr_rustsqlparser-rsnom-sqlgrmtools
To complete the degree
performance
Maintenance
Development efficiency

In the end, we chose Grmtools To develop Pisa-Proxy Medium SQL analysis .

Two 、Grmtools Use

Use Grmtools The parsing library is roughly divided into two steps , Let's take the implementation of calculator as an example .

  1. To write Lex and Yacc file

Lex: establish calc.l, The contents are as follows :

/%%
[0-9]+ "INT"
\+ "+"
\* "*"
\( "("
\) ")"
[\t ]+ ;

Grammar: establish calc.y The contents are as follows :

%start Expr
%avoid_insert "INT"
%%
Expr -> Result<u64, ()>:
Expr '+' Term { Ok($1? + $3?) }
| Term { $1 }
; Term -> Result<u64, ()>:
Term '*' Factor { Ok($1? * $3?) }
| Factor { $1 }
; Factor -> Result<u64, ()>:
'(' Expr ')' { $2 }
| 'INT'
{
let v = $1.map_err(|_| ())?;
parse_int($lexer.span_str(v.span()))
}
;
%%
  1. Construct lexical and grammatical parsers

Grmtools You need to generate lexical and grammatical parsers at compile time , So you need to create build.rs, It reads as follows :

use cfgrammar::yacc::YaccKind;
use lrlex::CTLexerBuilder;
fn main() -> Result<(), Box<dyn std::error::Error>> {
CTLexerBuilder::new()
.lrpar_config(|ctp| {
ctp.yacckind(YaccKind::Grmtools)
.grammar_in_src_dir("calc.y")
.unwrap()
})
.lexer_in_src_dir("calc.l")?
.build()?;
Ok(())
}
  1. Integrate parsing in the application
use std::env;

use lrlex::lrlex_mod;
use lrpar::lrpar_mod; // Using `lrlex_mod!` brings the lexer for `calc.l` into scope. By default the
// module name will be `calc_l` (i.e. the file name, minus any extensions,
// with a suffix of `_l`).
lrlex_mod!("calc.l");
// Using `lrpar_mod!` brings the parser for `calc.y` into scope. By default the
// module name will be `calc_y` (i.e. the file name, minus any extensions,
// with a suffix of `_y`).
lrpar_mod!("calc.y"); fn main() {
// Get the `LexerDef` for the `calc` language.
let lexerdef = calc_l::lexerdef();
let args: Vec<String> = env::args().collect();
// Now we create a lexer with the `lexer` method with which we can lex an
// input.
let lexer = lexerdef.lexer(&args[1]);
// Pass the lexer to the parser and lex and parse the input.
let (res, errs) = calc_y::parse(&lexer);
for e in errs {
println!("{}", e.pp(&lexer, &calc_y::token_epp));
}
match res {
Some(r) => println!("Result: {:?}", r),
_ => eprintln!("Unable to evaluate expression.")
}
}

See : grmtools - grmtools

As mentioned above , We need handwriting lexical analysis , Because in the original Grmtools in , Lexical parsing uses regular matching , For flexible and complex SQL The statement is , Not enough to satisfy , Therefore, it is necessary to create lexical analysis by hand , stay Grmtools To implement custom lexical parsing in, we need to implement the following Trait:

lrpar::NonStreamingLexer

It also provides a convenient way to instantiate :

lrlex::LRNonStreamingLexer::new()

3、 ... and 、 Problems encountered

Based on the above , We developed SQL lexing , Reuse the MySQL Official sql_yacc file , In the development process , Also encountered the following problems .

  1. Shift/Reduce error
Shift/Reduce conflicts:
State 619: Shift("TEXT_STRING") / Reduce(literal: "text_literal")

This is the use of LALR Common mistakes in algorithms , The causes of errors are generally solved by analyzing relevant rules , For example, common If-Else sentence , The following rules :

%nonassoc LOWER_THEN_ELSE
%nonassoc ELSE
stmt:
IF expr stmt %prec LOWER_THEN_ELSE
| IF expr stmt ELSE stmt

When ELSE When being scanned into the stack , There are two situations .

1) Continue according to the second rule Shift

2) Follow the first rule Reduce

This is the classic Shift/Reduce error .

Back to our question , The following rules apply :

literal -> String:
text_literal
{ }
| NUM_literal
{ }
... text_literal -> String:
'TEXT_STRING' {}
| 'NCHAR_STRING' {}
| text_literal 'TEXT_STRING' {}
...

analysis :

stackInput tokenaction
testShift test
test$Reduce: text_literal/Shift: TEXT_STRING

programme :

Priority needs to be set to solve , to text_literal Set a lower priority , As below :

%nonassoc 'LOWER_THEN_TEXT_STRING'
%nonassoc 'TEXT_STRING' literal -> String:
text_literal %prec 'LOWER_THEN_TEXT_STRING'
{ }
| NUM_literal
{ }
... text_literal -> String:
'TEXT_STRING' {}
| 'NCHAR_STRING' {}
| text_literal 'TEXT_STRING' {}
...
  1. SQL Contains Chinese questions

When using lexical parsing ,.chars() When a string array is generated, the length of the array is inconsistent with that of the character , Cause parsing error , To change to .as_bytes() Method .

Four 、 Optimize

  1. Analyze in the air ( See Appendix for test code ), Don't execute action Under the circumstances , The performance is as follows :
[[email protected] examples]$ time ./parser

real        0m4.788s
user 0m4.781s
sys 0m0.002s

Try to optimize , Here is the flame diagram :

Find out from the flame diagram , Most of the CPU Time consuming in serialization and deserialization , Here is the code to navigate to :

It can be seen that the data needs to be deserialized every time it is parsed , After compiling ,__GRM_DATA and __STABLE_DATA It's fixed , therefore grm,stable These two parameters can be passed as function parameters , Change to the following :

  1. reanalysis , Every time you parse , Will initialize a actions Array of , With grammar The increase of grammatical rules in ,actions The array of will also grow , And the array element type is dyn trait References to , There is overhead at runtime .

Look at the code , Find out actions Arrays are regular , As below :

::std::vec![&__gt_wrapper_0,
&__gt_wrapper_1,
&__gt_wrapper_2,
...
]

So we can manually construct , Here is the pseudo code :

match idx {
0 => __gt_wrapper_0(),
1 => __gt_wrapper_1(),
2 => __gt_wrapper_2(),
....
}

adopt gobolt Check the assembly , The difference is still very big , This saves the array overhead , It can also greatly reduce memory usage .

See :https://rust.godbolt.org/z/zTjW479f6

But as the actions The array keeps growing , There will be a lot of je,jmp Instructions , It is not clear whether it will affect CPU Branch prediction of , Such as whether the influence can be passed likely/unlikely Way optimization , There is no test yet .

Final flame diagram comparison

Final test results

[[email protected] examples]$ time ./parser

real        0m2.677s
user 0m2.667s
sys 0m0.007s

5、 ... and 、 summary

This paper is about Pisa-Proxy SQL The first in the series of analysis and interpretation , This paper introduces the in Pisa-Proxy In the development SQL Analyze the story behind , In the future, we will give you a detailed introduction Yacc Writing grammar rules ,Grmtools Components and utilities in , Coming soon .

appendix

Pisa-Proxy Of SQL Parsing code :

pisanix/pisa-proxy/parser/mysql at master · database-mesh/pisan

Test code

    let input = "select id, name from t where id = ?;"
let p = parser::Parser::new();
for _ in 0..1_000_000
{
let _ = p.parse(input);
}

Pisanix

Project address :https://github.com/database-mesh/pisanix

Official website address :https://www.pisanix.io/

Database Mesh:https://www.database-mesh.io/

SphereEx Official website :https://www.sphere-ex.com

Pisa-Proxy And SQL More articles about analytic practice

  1. Step by step :MySQL Architectural Overview -&gt; Query execution process -&gt;SQL Parsing order

    Preface : Always wanted to know one SQL How statements are executed , What's the order of its execution , Then review and summarize the information of all parties , There is the following blog post . This article will start from MySQL Overall framework ---> Query execution process ---> Statement execution order ...

  2. MySQL Architectural Overview -&gt; Query execution process -&gt;SQL Parsing order

    Reference:  https://www.cnblogs.com/annsshadow/p/5037667.html Preface : Always wanted to know one SQL How statements are executed , What's the order of its execution , then ...

  3. Spark And SQL analysis ( Source code reading ten )

    How to better use and monitor sparkSQL? Maybe we can get a deeper understanding of its underlying principle . The previous summary has written the traditional database and Spark Of sql The difference between analysis . So let's get down to the point ~ Today, Spark It has supported many ...

  4. It's high technology sql analysis

    Question: why sql Analysis has something to do with height ?Answer: Because the database is always the core of the system ,CRUD So deep into Manon's heart ... If you can put the CRUD Transform it into a tall, high-tech one , This is not for the benefit of the people ... CRUD Namely Cre ...

  5. sql Parse string added to temporary table sql stored procedure in Parameter input

    sql Parse string added to temporary table   sql stored procedure in Parameter input resolvent Parse the string Add to In the temporary table SELECT * into # A temporary table    FROM dbo.Func_SplitOneCol ...

  6. oracle Memory structure share pool sql Analytical process

    1.sql Analytical process oracle First of all, will SQL The text is translated into ASCII character , And then according to hash Function calculates its corresponding hash value (hash_value). Based on the calculation hash It's worth it library cache Find the corresponding ...

  7. Based on simplicity sql Of the statement sql Analysis principle and application in big data

    Based on simplicity sql Of the statement sql Analysis principle and application in big data Li Wanhong The common people call on local tyrants to divide the land . Common prosperity , It will come true one day . Know all about the aliens and the universe you don't know. I really want to :http://pan.baidu.com/s/1 ...

  8. Realize one by yourself SQL Parsing engine

    Realize one by yourself SQL Parsing engine function : Enter the SQL A sequence of statements is converted into a sequence of operations that can be run , And return the result set of the query . SQL The parsing engine includes query compilation, query optimization and query execution , It mainly includes 3 A step : Query analysis : Making logic ...

  9. [ turn ]ORACLE SQL Hard parsing and soft parsing of parsing

    http://blog.chinaunix.net/uid-25909722-id-3363789.html When the client process , take SQL The statement is sent through the listener to Oracle when , It will trigger a Server pro ...

  10. How relational databases work -SQL analysis ( Translated from Coding-Geek article )

    This article is translated from Coding-Geek article :< How does a relational database work>. Link to the original text :http://coding-geek.com/how-data ...

Random recommendation

  1. 【11-01】Sublime text Learning notes

    >>> Shortcut key CTRL+P -> Open the file according to its name “# identification ”“: Line number ” Ctrl+Shift+P ->  open Package Control Ctrl+R -> check ...

  2. EJS What is it? , How to use it? , And the advantages

    One . What is? EJS EJS It's a JavaScript Template library , Used to from JSON Generate... In data HTML character string . Two . Why use EJS With the original JavaScript Comparison , It's easier for people who don't know much about your code to get through ...

  3. Ghost Command usage

    We know , In general use Ghost when , It's all in DOS Type... After the prompt "Ghost", And then into Ghost Graphical interface operation of : Then can we Ghost It only works through the command line ? The answer is yes , Key in ...

  4. linux pts/0 The meaning of

    pts It's called pseudo terminal or virtual terminal , The specific performance is that you open a terminal , This terminal is called pts/0, If you open another terminal , This new terminal is called pts /1. For example, use who Command to query the currently logged in user , You can see each user's TTY equipment ...

  5. Java Integer And int A deep understanding

    I'm doing it today Object Automatically turn to Integer Judgment after type , I met a point I didn't understand , When the value exceeds 127 after , Two of the same values Object Object use == The result of the judgment is false. Object a = 128 ...

  6. About linux kernel slab Some thoughts on memory management

    linux kernel Memory management is a big topic , Here's a little bit of personal information about slab A little thinking and summary of the module . There are some books slab Introduced as cache , It's going to make people and cache, especially cpu cache confusion , Cause a misunderstanding .sla ...

  7. idea Use in jetty plug-in unit

    Reference blog :http://blog.csdn.net/xiejx618/article/details/49936541

  8. 『TensorFlow』 Play seven _ preservation &amp; Load session _ Bawanghuima

    Shougeng : because TensorFlow Strange form of , So loading saves sess, Save the currently active variables in the session , So you have to make sure ( Other networks also require this ) The structure of saving network and loading network is the same , And the variable names must be consistent , This is a caffe ...

  9. Goroutines and Channels( 3、 ... and )

    clock Every connection to the server will have a goroutine. In this section we will create a echo The server , This service has multiple in each connection goroutine. majority echo Services will only return what they read , Like this one below ...

  10. Redis Learning notes - Transaction control (Centos7)

    One . Transaction control 1. Simple transaction control redis have access to mult Command stores all subsequent commands in the queue , Only use exec Only when the command is executed . 127.0.0.1:6379> multi OK 127.0.0.1 ...