当前位置:网站首页>ZUCC_ Principles of compiling language and compilation_ Experiment 05 regular expression, finite automata, lexical analysis

ZUCC_ Principles of compiling language and compilation_ Experiment 05 regular expression, finite automata, lexical analysis

2022-06-24 08:29:00 The stars don't want to roll

Compiling language principle and compiling experiment report

Course name Programming language principle and compilation
Experimental projects Regular expressions 、 Finite automaton 、 Lexical analysis

Experimental content

One 、 Reading Courseware , Read textbook No 2 Chapter

  1. Understand lexical elements token The concept of , Understand the types of lexical elements , Lexical semantic value

    • Lexical element is the smallest grammatical unit with independent meaning
    • identifier ( Variable name , Function name , Class name ……); keyword (if while for function static); constant ( Int ,Float, Complex,Boolean,Symbol…); Operator (+ - * / ^ *= :=); Delimiter ([ ] {} () ; ,); character string (“asf”)
    • The identifier is generally the sequence number in the symbol table , For keywords and operators and delimiters can be empty , For constants and strings, it is the sequence number in itself or symbol table
  2. Understand the basics of regular expressions , And the combination method ( choice 、 Connect 、 repeat )

    • Regular expressions represent string patterns that conform to a particular format , Used to describe the structure of lexical elements
    • If r and s All regular expressions , Respectively represent language L and L(s), that :
      • r|s Is a regular expression , The language of expression is L(r)∪L(s)
      • rs Is a regular expression , Presentation language L(r)•L(s)
      • r* Is a regular expression , Presentation language (L(r))*
  3. understand The longest match , Priority

    • When the prefix of the input string is successfully matched with multiple end states , Always select the longest prefix to match
    • ‘*’ > ‘.’ > ‘|’, The last two are both left-sided
  4. understand p13 2-1 2-2

  5. understand NFA 、DFA,DFA Minimization method

    1. Deterministic finite automata (DFA): For a given input symbol , Just do the only action ( No, ε Edge transfer )
    2. Nondeterministic finite automata (NFA): For the same input symbol , There are many actions to choose from
    3. Treat the state set as a state . Whether the transition of each state on the subset is equivalent , If equivalent , Then these states can be merged .
  6. Automata.js Enter regular expression , View results

Two 、 Write the following regular expression

1、 tabs \t And spaces \t \t

^(\t|\s)*$

2、C/C++ Language notes // /* a/**/sfasfd */ Look for network resources

/\*[^*]*\*+([^/*][^*]*\*+)*/

https://blog.csdn.net/zhangpeterx/article/details/88115338

3、 String constant “this is a string”

^"[a-zA-Z\s]+"$

image-20220418155622619

4、 Floating point numbers 3.14159

^(-?\d+)(\.\d+)?$

https://blog.csdn.net/weixin_39867296/article/details/111611301

image-20220418155749964

Study URL Examples of matching regular expressions

(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_|!:,.;]+[-A-Za-z0-9+&@#/%=~_|]

https://www.cnblogs.com/speeding/p/5097790.html

3、 ... and 、 Please explain what is with practical examples ( p13) The longest matching rule 、 Rule order takes precedence

The longest matching rule : For the input substring ‘if8’, It can be combined with reserved words ‘if’ Match two characters , It can also match the identifier by three characters , Longer matches are preferred here . therefore ‘if8’ Is an identifier. . Another example is using greedy matching in regular matching , For example, expressions : a.*b To match ababab, It will match the longest possible string , Here is ababab.g

Rule order takes precedence : For the input substring ‘if 89’, The first match is the reserved word ‘if’, So change the substring to start with a reserved word .

Four 、 Consider the following regular expression , Draw the corresponding NFA/DFA

( a b ∣ a c ) ∗ ( 0 ∣ 1 ) ∗ 1100 1 ∗ ( 01 ∣ 10 ∣ 00 ) ∗ 11 (ab|ac)^{*}\\ (0|1)^{*} 11001^{*}\\ (01|10|00)^{*} 11 (abac)(01)11001(011000)11

1、 structure NFA

image-20220413085341233

image-20220413085401758

image-20220413085423055

2、 structure DFA

image-20220413091905883

IMG_2DBDB0F05A6E-1

IMG_6F1F18536DF8-1

3、 To minimize the DFA

image-20220417135518408

image-20220417140555355

image-20220417141554295

5、 ... and 、 Study code/lex.zip understand lex The operation principle of lexical generation tool

In the actual compiler Lexical analyzer Lexer , Is parsed by the parser Parser call , See the following topic for details calcvar Code

Lex Is a lexical analyzer generation tool , It supports the use of regular expressions to describe the patterns of lexical units , This paper gives a specification of lexical analyzer , And generate the corresponding implementation code . Lex The input method of the program is called Lex Language , and Lex The program itself is called Lex compiler ,Flex yes Lex An alternative implementation of the compiler ( lex,flex Relationship and cc,gcc The relationship is very similar ).

Lex How to use the program As shown in the figure below :

img

One Lex The program usually consists of the following parts :

  • Declaration part : The declaration section usually includes Variable , Explicit constants And the definition of regular expressions , Explicit constants Is an identifier whose value is a number , Used to represent the type of lexical unit .
  • Conversion rules : The transformation rule has the following form : Pattern { action }. Each pattern is a regular expression , You can use the regular definitions given in the declaration section . The action part is a code snippet , Usually use C Language writing .
  • Auxiliary function : This section defines the functions required for each action , It can also include main function , This part of the code will be put in the output C In the code .

Lex How the program works :

Lex The program provides a function int yylex() And two variables yyleng,yytext For external use ( Usually a parser ) Read , call .

When calling yylex When , The program will start from yyin Read in characters one by one in the input stream pointed to by the pointer , The program found the longest connection with a pattern Pi After the matching string , The string will be saved to yytext variable , And set up yyleng The variable is the length of the string , The string is also a lexical unit analyzed by the lexical analyzer .

then , The lexical analyzer will execute the pattern Pi Corresponding action Ai, And use yylex The function returns Ai The return value of ( It is usually the type of lexical unit ).

https://www.bwangel.me/2019/12/15/flex/

6、 ... and 、 Look at the warehouse https://gitee.com/sigcc/plzoofs Catalog calcvar Language program

1、 Compile the program , function ,https://gitee.com/sigcc/plzoofs/blob/master/calcvar/README.markdown Refer to the instructions to run in debug mode

image-20220419152410893

View the lexical elements token, Grammar tree

image-20220419153318801

2、 Understand the binding and storage of variable values in the environment

Compiler principle ( nineteen )—— Runtime storage space management ( Variable access environment )_ Pay great attention to mathematics and 821 The blog of -CSDN Blog

3、 understand The lexical description file lexer.fsl Code for

Random coding knowledge | Using FSLexYacc, the F# lexer and parser (thanos.codes)

4、 Preliminary understanding of grammar description file parser.fsy Code for

dotnet run -vn see dotnet Call the command of lexical analyzer generation tool , The contents are as follows , The specific path is different

dotnet "C:\Users\gm\.nuget\packages\fslexyacc\10.2.0\build\/fslex/netcoreapp3.1\fslex.dll"  -o "Scanner.fs" --module Scanner --unicode Scanner.fsl
...  
dotnet "C:\Users\gm\.nuget\packages\fslexyacc\10.2.0\build\/fsyacc/netcoreapp3.1\fsyacc.dll"  -o "Parser.fs" --module Parser Parser.fsy
dotnet "C:\Users\shuhe\.nuget\packages\fslexyacc\10.2.0\build\fslex\netcoreapp3.1\fslex.dll"  -o "Scanner.fs" --module Scanner --unicode Scanner.fsl

image-20220419170615514

dotnet "C:\Users\shuhe\.nuget\packages\fslexyacc\10.2.0\build\/fsyacc\netcoreapp3.1\fsyacc.dll"  -o "Parser.fs" --module Parser Parser.fsy

image-20220419170656720

understand fsproj Project file parameters , These parameters are passed to the above command

<FsLex Include="lexer.fsl">
<OtherFlags>--module Lexer --unicode</OtherFlags>
</FsLex>
<FsYacc Include="parser.fsy">
<OtherFlags>--module Parser</OtherFlags>
</FsYacc>

【 translate 】Python Lex Yacc manual - P_Chou - Blog Garden (cnblogs.com)

7、 ... and 、 Study CLex.fsl structure , understand MicroC Lexical analyzer of

Run in debug mode microC Interpreter , compiler -g Specific view ReadME.md

  dotnet build -v n interpc.fsproj 

image-20220419154400586

dotnet "C:\Users\shuhe\.nuget\packages\fslexyacc\10.2.0\build\fslex\netcoreapp3.1\fslex.dll"   -o "CLex.fs" --module CLex --unicode CLex.fsl

image-20220419154919848

  dotnet "C:\Users\shuhe\.nuget\packages\fslexyacc\10.2.0\build\/fsyacc\netcoreapp3.1\fsyacc.dll"  -o "CPar.fs" --module CPar CPar.fsy

image-20220419155112145

image-20220419155341980

image-20220419155611733

image-20220419155556379

understand The meaning of the word element , Pay attention to the types of lexical elements , And the corresponding semantic value <Type,Value>

  • “token” There are many books that translate the word “ mark ”, We think that the more appropriate translation should be “ Word symbols ”. It is a programming language “ The smallest lexical unit with independent meanings ”, In this sense, it is related to our natural language “ word ” Have the same meaning . however , It's not exactly the same , Because of the “token” It's not just about “ word ”, It also includes punctuation marks 、 The operator 、 Separator, etc . Translate it into “ Word symbols ” Just to reflect this – spot . But for the sake of brevity , We use “ word ” The word" .-— translator's note

  • VOID Semantic value and type are the same

  • NAME “n” The type is NAME The semantic value is “n”

  • CSTINT 1 The type is CSTINT The semantic value is “1”

~ microc>dotnet run -p .\interpc.fsproj -g ex1.c 8
Micro-C interpreter v 1.1.0 of 2021-3-22
interpreting ex1.c ...inputargs:[8]
Token:
VOID, NAME "main", LPAR, INT, NAME "n", RPAR, LBRACE, WHILE, LPAR, NAME "n", GT, CSTINT 0, RPAR, LBRACE, PRINT, NAME "n", SEMI, NAME "n", ASSIGN, NAME "n", MINUS, CSTINT 1, SEMI, RBRACE, PRINTLN, SEMI, RBRACE, EOF

image-20220419160251752

Understand keyword handling

  • identifier Name Put it in the keyword function keyword The final treatment of

Comprehension notes To deal with

  • // A single

    • Single line comment processing rules , Call the response handler ,

    • // Parameter is lexbuf

      // After processing lexbuf The content has been updated , Comments section filter

      // call Token The rule function continues the processing after the comment section

      | “/*” { Comment lexbuf; Token lexbuf } // Multiline comment , call Comment The rules

  • /* */ Multiple lines

    • Multiline comment , call Comment The rules ,

    • and Comment = parse

      | “/*” { Comment lexbuf; Comment lexbuf } // Nested processing of annotations

      | “*/” { () } // End of comment processing

      | ‘\n’ { lexbuf.EndPos <- lexbuf.EndPos.NextLine; Comment lexbuf } // Comments cross line processing

      | (eof | ‘\026’) { failwith “Lexer error: unterminated comment” } // Multiline comments are not enclosed

      | _ { Comment lexbuf } // In any other case, continue to process subsequent characters

      and EndLineComment = parse

      | ‘\n’ { lexbuf.EndPos <- lexbuf.EndPos.NextLine } // Update end of line position , return

      | (eof | ‘\026’) { () } // End of file ,26 yes CTRL+Z Of ASCII code , It is also a terminator , () Exit back

      | _ { EndLineComment lexbuf } // Continue to read lexbuf Middle next character

Understand the blank Line break To deal with

  • | '\n'            { lexbuf.EndPos <- lexbuf.EndPos.NextLine; Token lexbuf } //  Line feed 
                                                                               // EndPos  It's a built-in type  Position Example , Indicates the end position of the current line 
    

Understand the handling of strings (MicroC Only lexical analysis , Not fully implemented )

  • | ‘"’ { CSTSTRING (String [] lexbuf) } // Call string processing rules

  •   and String chars = parse 
        | '"'
            { Microsoft.FSharp.Core.String.concat "" (List.map string (List.rev chars)) } 
                                                              //  End of string , By character array chars Construct string 
                                                              //  Because it is a list when it is constructed  cons :: operation 
                                                              //  You need to use List.rev  Flip character array 
                                                              // ::  Is said   Add it to the front 
        | '\\' ['\\' '"' 'a' 'b' 't' 'n' 'v' 'f' 'r']   // character string  "\a"  After reading lexical analyzer   To see is  "\\a"
            { String (cEscape (lexemeAsString lexbuf) :: chars) lexbuf }
        | "''" 
            { String ('\'' :: chars) lexbuf }
        | '\\'
            { failwith "Lexer error: illegal escape sequence" }
        | (eof | '\026')       { failwith "Lexer error: unterminated string" } 
                                                         //  End of file... In string 
      
        | ['\n' '\r']  { failwith "Lexer error: newline in string" }  // Carriage return in string 
      
        | ['\000'-'\031' '\127' '\255']    
            { failwith "Lexer error: invalid character in string" }  //  Appears in the string  ASCII  Control characters 
        | _
            { String (char (lexbuf.LexemeChar 0) :: chars) lexbuf } //  Will read the 1 Characters added to the temporary chars Array 
      
    

Understand the handling of escape characters (MicroC Only lexical analysis , Not fully implemented )

  •   //  String escape handler 
      let cEscape s = 
          match s with 
          | "\\\\" -> '\\'
          | "\\\"" -> '\"'
          | "\\a"  -> '\007'
          | "\\b"  -> '\008'
          | "\\t"  -> '\t'
          | "\\n"  -> '\n'
          | "\\v"  -> '\011'
          | "\\f"  -> '\012'
          | "\\r"  -> '\r'
          | _      -> failwith "Lexer error: impossible C escape"
    

8、 ... and 、 Big job grouping , And determine the language Decaf/MicroC/ Tiger ... Start to implement lexical analyzer ( Choose your own language )

  • Big homework lexical part

  • Read the language specification use fslex Implement its lexical analyzer

    • Textbook appendix p360 page Tiger language norm
  • Note the nested annotation implementation /* /* */ */

  • Note the string implementation "" '' " ' ' "

Nine 、 Reading books

  • The nature of computation The first 3 Chapter The automaton part , Understand the manual implementation of table driven method NFA,DFA Technology
    https://bb.zucc.edu.cn/bbcswebdav/users/j04014/books/Understanding.Computation

    • Ruby For language use, see The first 1 Chapter

    • The first 2 Chapter Will be used later , Preview in advance

Ten 、( Optional ) read JFLAP-simple tutorial.pdf use JFLAP Structural problems in 4 Of NFA/DFA, And screenshot , Keep the file . JFLAP stay bb On soft Directory , The operation mode is

java -jar JFLAP.jar
原网站

版权声明
本文为[The stars don't want to roll]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206240605218785.html