当前位置:网站首页>JSON Library Tutorial from scratch (III): parsing strings, learning and sorting notes

JSON Library Tutorial from scratch (III): parsing strings, learning and sorting notes

2022-06-25 05:17:00 zhongh58

Original website : From scratch JSON Library Tutorial - You know

1. JSON String Syntax

JSON String Syntax and C The language is very similar , The characters are enclosed in double quotation marks , Such as "Hello". But the string is separated by double quotation marks , So how can I insert a double quotation mark into a string ? hold a"b It's written in "a"b" Definitely not , I don't know where the string ends .

therefore , We need to introduce escape characters (escape character),C Language and JSON All use \( Backslash ) As escape character , that " In the string, it is expressed as \",a"b Of JSON The string is written as "a\"b". As shown in the following String Syntax ,JSON In support of 9 Escape sequence :

string = quotation-mark *char quotation-mark
char = unescaped /
   escape (
       %x22 /          ; "    quotation mark  U+0022
       %x5C /          ; \    reverse solidus U+005C
       %x2F /          ; /    solidus         U+002F
       %x62 /          ; b    backspace       U+0008
       %x66 /          ; f    form feed       U+000C
       %x6E /          ; n    line feed       U+000A
       %x72 /          ; r    carriage return U+000D
       %x74 /          ; t    tab             U+0009
       %x75 4HEXDIG )  ; uXXXX                U+XXXX
escape = %x5C          ; \
quotation-mark = %x22  ; "
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

A simple translation ,JSON A string is composed of two double quotation marks with zero or more characters . Character separation no escape character or escape sequence . The escape sequence has 9 Kind of , It all starts with a backslash , As is common \n Represents a newline character . What's more special is \uXXXX, among XXXX by 16 Carry on UTF-16 code , This unit will not handle this escape sequence , Let's break it down later .

Non escape characters are ordinary characters , The syntax lists the legal code point range ( The code point is introduced in the next unit ). It should be noted that , This scope does not include 0 to 31、 Double quotes and backslashes , These code points must be expressed by escape .

2. String representation

stay C In language , Strings are generally represented as empty terminated strings (null-terminated string), Empty character ('\0') Represents the end of the string . However ,JSON Strings are allowed to contain empty characters , For example, this JSON "Hello\u0000World" Is a single string , After parsing, it is 11 Characters . If you simply use a null terminated character to indicate JSON Results after analysis , Can't handle empty characters .

  therefore , We can allocate memory to store the parsed characters , And the number of recording characters ( The length of the string ). Because most C Programs assume that a string is an empty terminated string , Let's add an empty character at the end , Then there is no need to deal with \u0000 The application of this character can simply regard it as a null terminated string .

lept_value In fact, it is a variant type (variant type), We go through type To determine what type it is now , This also determines which members are valid . First, we simply add two members to the structure :

typedef struct {
    char* s;
    size_t len;
    double n;
    lept_type type;
}lept_value;

We know , A value cannot be both a number and a string , So we can use C Linguistic union To save memory :

typedef struct {
    union {
        struct { char* s; size_t len; }s;  /* string */
        double n;                          /* number */
    }u;
    lept_type type;
}lept_value;

These two designs are in 32 The memory layout of bit platform is as follows , It can be seen that... Is used on the right union Can save memory .

  We're going to take the previous v->n Change to v->u.n. To access the string data , Use v->u.s.s and v->u.s.len. This kind of writing is troublesome , Actually C11 New anonymous struct/union grammar , You can use v->n、v->s、v->len To make an interview .

3. memory management

Because the length of the string is not fixed , We need to allocate memory dynamically . For the sake of simplicity , We use the standard library <stdlib.h> Medium malloc()、realloc() and free() To allocate / Free memory .

When setting a value as a string , We need to make a copy of the string in the parameter :

void lept_set_string(lept_value* v, const char* s, size_t len) {
    assert(v != NULL && (s != NULL || len == 0));
    lept_free(v);
    v->u.s.s = (char*)malloc(len + 1);
    memcpy(v->u.s.s, s, len);
    v->u.s.s[len] = '\0';
    v->u.s.len = len;
    v->type = LEPT_STRING;
}

Setting this v Before , We need to call lept_free(v) Go empty v Memory that may be allocated . For example, there is already a string , We have to release it first . Then simply use malloc() Distribution and use memcpy() Copy , And fill in the blank character at the end .malloc(len + 1) Medium 1 Because the end is empty .

void lept_free(lept_value* v) {
    assert(v != NULL);
    if (v->type == LEPT_STRING)
        free(v->u.s.s);
    v->type = LEPT_NULL;
}

  Because we will check v The type of , Before calling all access functions , We must initialize this type . So we joined in lept_init(v), Because it is very simple, we use macros to realize :

#define lept_init(v) do { (v)->type = LEPT_NULL; } while(0)

4. Buffer and stack

We parse the string ( And then the array 、 object ) when , You need to store the parsing results in a temporary buffer , Last but not least lept_set_string() Set the result of the buffer into the value . Before you finish parsing a string , The size of this buffer is unpredictable . therefore , We can use dynamic arrays (dynamic array) This data structure , That is, when the array space is insufficient , Can automatically expand .C++ Standard library std::vector It is also a dynamic array .

If every time you parse a string , Re create a dynamic array , It is time-consuming . We can reuse this dynamic array , Every time you parse JSON You just need to create one . And we will find , Whether it's parsing strings 、 Array or object , We only need to access this dynamic array in a first in, last out way . let me put it another way , We need a dynamic stack data structure .

We put data from a dynamic stack into lept_context in :

typedef struct {
    const char* json;
    char* stack;
    size_t size, top;
}lept_context;

among size Is the current stack capacity ,top Is the top of the stack ( Because we will expand stack, So don't put top Store in pointer form ).

We're creating lept_context Initialization when stack And finally free memory :

int lept_parse(lept_value* v, const char* json) {
    lept_context c;
    int ret;
    assert(v != NULL);
    c.json = json;
    c.stack = NULL;        /* <- */
    c.size = c.top = 0;    /* <- */
    lept_init(v);
    lept_parse_whitespace(&c);
    if ((ret = lept_parse_value(&c, v)) == LEPT_PARSE_OK) {
        /* ... */
    }
    assert(c.top == 0);    /* <- */
    free(c.stack);         /* <- */
    return ret;
}

On release , Add assertions to ensure that all data is ejected .

then , We implement stack push and pop operations . It is different from the normal stack , Our stack is stored in bytes . Data of any size can be pressed every time , It returns a pointer to the beginning of the data ( Meeting C++ You can refer to [1]):

#ifndef LEPT_PARSE_STACK_INIT_SIZE
#define LEPT_PARSE_STACK_INIT_SIZE 256
#endif

static void* lept_context_push(lept_context* c, size_t size) {
    void* ret;
    assert(size > 0);
    if (c->top + size >= c->size) {
        if (c->size == 0)
            c->size = LEPT_PARSE_STACK_INIT_SIZE;
        while (c->top + size >= c->size)
            c->size += c->size >> 1;  /* c->size * 1.5 */
        c->stack = (char*)realloc(c->stack, c->size);
    }
    ret = c->stack + c->top;
    c->top += size;
    return ret;
}

static void* lept_context_pop(lept_context* c, size_t size) {
    assert(c->top >= size);
    return c->stack + (c->top -= size);
}

 

5. Parse string

With the above tools , The task of parsing strings becomes very simple . We just need to back up the top of the stack first , Then put the parsed characters on the stack , Finally, calculate the length and pop up all the characters at once , Then set it to the value . The following is a partial implementation , The escape and some illegal character verification are not handled .

#define PUTC(c, ch) do { *(char*)lept_context_push(c, sizeof(char)) = (ch); } while(0)

static int lept_parse_string(lept_context* c, lept_value* v) {
    size_t head = c->top, len;
    const char* p;
    EXPECT(c, '\"');
    p = c->json;
    for (;;) {
        char ch = *p++;
        switch (ch) {
            case '\"':
                len = c->top - head;
                lept_set_string(v, (const char*)lept_context_pop(c, len), len);
                c->json = p;
                return LEPT_PARSE_OK;
            case '\0':
                c->top = head;
                return LEPT_PARSE_MISS_QUOTATION_MARK;
            default:
                PUTC(c, ch);
        }
    }
}

about *p++, to *p, After the operation, proceed to ++ operation .

原网站

版权声明
本文为[zhongh58]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202210518428272.html