当前位置：网站首页>Bert series Roberta Albert erine detailed explanation and use learning notes

Bert series Roberta Albert erine detailed explanation and use learning notes

2022-06-24 05:15:00 【Goose】

0. background

Next blog , This article mainly talks about BERT as well as BERT Derived model , Such as RoBERTa ALBERT ERINE And so on .

But first of all, let's take a look BERT.

1. BERT

BERT The full name of Bidirectional Encoder Representations from Transformers, From the thesis title and BERT English full name , You can see BERT What we do is encode the information of a context . The main comparison object of the whole paper is ELMo and GPT,ELMo and GPT The biggest problem is 「 Not really bidirectional coding 」

contrast OpenAI GPT(Generative pre-trained transformer),BERT It's two-way Transformer block Connect ; It's like one way RNN And two way RNN The difference between , Intuitively, it will work better .

)P(w_i|w_1..w_{i-1}) and P(w_i|w_{i+1}..w_n) As an objective function , Two independent training offices representation Then joining together , and BERT in P(w_i|w1..w_{i-1},w_{i+1}..w_n) Train as an objective function LM.

image.png

GPT Use of is Transformer Structural Decoder, So it's definitely not two-way ,
ELMo Although the LSTM The forward and reverse vectors of are spliced together , But it's not really two-way （ Think , Splicing but no interaction ）, Such a network structure is very disadvantageous to downstream tasks , for instance , When doing Q & A tasks , It is very important to encode the context in two directions .

BERT What is used is transformer Of encoder part , Code each token I have considered all input token Interaction ,「 therefore BERT Is the real bidirectional coding model 」

Here is a brief description of the... In the vector model feature-based And fine-tunning normal form :

feature-based normal form ( feature extraction )

Representative works such as EMLo, stay 17 Years ago ,Transformer Not out of the time , People solve it most often NLP The method of the task is 「 Use the word vector trained by others as embedding」, And then followed by all kinds of newly initialized RNN/LSTM/CNN Wait for the network structure , That is, pre training only provides feature-based Of embedding.

Use feature-based Model weights will not be updated .

fine-tunning normal form ( fine-tuning )

Representative works such as GPT, When used for downstream tasks , Not only do you keep the input embedding,Transformer The parameters inside （ Such as attention layer 、 Fully connected layer ） You can also keep , stay fine-tuning Only need to be in the original Transfomer Add some simple layers , It can be applied to specific downstream tasks .

BERT Of course, it belongs to fine-tuning normal form .

Use feature-based The model weights are updated .

1.1 Model architecture

BERT Provides a unified structure for solving various downstream tasks . When we want to fine tune specific tasks , We just need to add some network layers to the original structure OK 了 ,「 In this way, there is little difference between the network structure of pre training and that of specific downstream tasks , Help to put BERT The features learned during pre training should be preserved as much as possible 」.

The model can be simply summarized into three parts , It's the input layer , Middle layer , And output layer . All of these and transformer Of encoder Agreement , Except that the input layer changes slightly

Input layer

In order to make BERT The model adapts to downstream tasks （ For example, classification tasks , And sentence relations QA The task of ）, The input will be changed to CLS+ fragment A+SEP+( fragment B+SEP）

among

CLS: It represents the special task of classification token, Its output is the output of the model pooler output
SEP： Separator
fragment A And sentences B Is the input text of the model , In which the fragment B Can be null , Then the input becomes CLS+ fragment ASEP

because trasnformer Unable to get word position information ,BERT and transformer Also joined Absolute position position encoding, But and transformer The difference is ,BERT Is it used transformer Corresponding functional type (functional) Of encoding The way , Instead, it uses something like word embedding The way （Parametric）, Direct access to position embedding.

Because we modified the input , Make the model possible to have multiple sentences Segment The input of , So we also need to join segment Of embedding, for example [CLS], A_1, A_2, A_3,[SEP], B_1, B_2, B_3, [SEP] Corresponding segment The input is [0,0,0,0,1,1,1,1], And then according to segment id Conduct embedding_lookup obtain segment embedding. code snippet as follows .

tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
  tokens.append(token)
  segment_ids.append(0)

tokens.append("[SEP]")
segment_ids.append(0)

for token in tokens_b:
  tokens.append(token)
  segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)

There are three input layers embedding Add up （position embedding + segment embedding + token embedding） Why is that ？

First, let's simply assume that we have a token, Let's assume that our dictionary size （vocabulary_size） = 5, Corresponding token_id yes 2, This token The location is the... Th 0 A place , Our maximum position length is max_position_size = 6, And we can have two segment, This token It belongs to segment = 0 The situation of .

First of all, we will analyze three different types of embedding lookup The operation of , In the following code, we , Fixed three types of embedding matrix, Namely token_embedding,position_embedding,segment_embedding. First, let's be clear , natural embedding lookup Namely embedding id Conduct onehot after , And then in and embedding matrix Do matrix multiplication , Let's look at... In the example embd_embd_onehot_impl and embd_token, The results of the two are consistent .

We get the data of three categories embedding after （embd_token, embd_position, embd_sum）, Then add them together , obtain embd_sum.

The result is similar to , Compare the three categories onehot Later results concat get up , Proceed again embedding lookup The results are consistent . For example, below , We will token_id_onehot, position_id_onehot, segment_id_onehot These three onehot After the results of the concat Get up and get concat_id_onehot, With the three embedding matrix Of concat After the results of the concat_embedding, Do matrix multiplication , The result embd_cat.

You can find embd_sum == embd_cat. Refer to the following code for details .

import tensorflow as tf
token_id = 2
vocabulary_size = 5
position = 0
max_position_size = 6
segment_id = 0
segment_size = 2

embedding_size = 4
token_embedding = tf.constant([[-3.,-2,-1, 0],[1,2,3,4], [5,6,7,8], [9,10, 11,12], [13,14,15,16]]) #size: (vocabulary_size, embedding_size)
position_embedding = tf.constant([[17., 18, 19, 20], [21,22,23,24], [25,26,27,28], [29,30,31,32], [33,34,35,36], [37,38,39,40]]) #size:(max_position_size, embedding_size)
segment_embedding = tf.constant([[41.,42,43,44], [45,46,47,48]]) #size:(segment_size, embedding_size)

token_id_onehot = tf.one_hot(token_id, vocabulary_size)
position_id_onehot = tf.one_hot(position, max_position_size)
segment_id_onehot = tf.one_hot(segment_id, segment_size)

embd_embd_onehot_impl = tf.matmul([token_id_onehot], token_embedding)
embd_token = tf.nn.embedding_lookup(token_embedding, token_id)
embd_position = tf.nn.embedding_lookup(position_embedding, position)
embd_segment = tf.nn.embedding_lookup(segment_embedding, segment_id)
embd_sum = tf.reduce_sum([embd_token, embd_position, embd_segment], axis=0)

concat_id_onehot = tf.concat([token_id_onehot, position_id_onehot, segment_id_onehot], axis=0)
concat_embedding = tf.concat([token_embedding, position_embedding, segment_embedding], axis=0)
embd_cat = tf.matmul([concat_id_onehot], concat_embedding)

with tf.Session() as sess:
    print(sess.run(embd_embd_onehot_impl)) # [[5. 6. 7. 8.]]
    print(sess.run(embd_token)) # [5. 6. 7. 8.]
    print(sess.run(embd_position)) # [17. 18. 19. 20.]
    print(sess.run(embd_segment)) # [41. 42. 43. 44.]
    print(sess.run(embd_sum)) # [63. 66. 69. 72.]
    print(sess.run(concat_embedding))
    '''
    [[-3. -2. -1.  0.]
    [ 1.  2.  3.  4.]
    [ 5.  6.  7.  8.]
    [ 9. 10. 11. 12.]
    [13. 14. 15. 16.]
    [17. 18. 19. 20.]
    [21. 22. 23. 24.]
    [25. 26. 27. 28.]
    [29. 30. 31. 32.]
    [33. 34. 35. 36.]
    [37. 38. 39. 40.]
    [41. 42. 43. 44.]
    [45. 46. 47. 48.]]
    [[63. 66. 69. 72.]]
    '''
    print(sess.run(concat_id_onehot)) # [0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0.]
    print(sess.run(embd_cat)) # [[63. 66. 69. 72.]]

Middle layer

The middle layer of the model and transformer Of encoder equally , It's all by self-attention layer + ADD&BatchNorm layer + FFN Layers .

Output layer

Each input of the model corresponds to this output , We can choose different output according to different tasks , There are two main types of output

pooler output： The corresponding is CLS Output .
sequence output： This corresponds to the final output of all other input words .

1.2 Model input

「WordPiece」

At the time of model input , Not a specific word , It is WordPiece, Concrete , Let's look at the native BERT Of vocab Thesaurus , There are some English words with ## Prefixed , for example ##bed wait , Such as embedding The word , adopt WordPiece It will be split into em、##bed、##d、##ing, with ## A prefixed word means that it is part of the word , Not the finished word （ So in the vocabulary bed、##bed, Their meanings are completely different ）, You can search for specific information BPE, introduce WordPiece As an input, it can effectively alleviate OOV problem .

As for Chinese , In my opinion, I still use single word as input , Because Chinese is difficult to be like English , And then split it .

「Segment Pairs Input 」

BERT Sentence pairs are introduced as input , Why introduce sentence pairs as input , It's to make BERT Able to cope with more downstream tasks （ For example, sentence similarity task , Q & A tasks are multi sentence input ）. Be careful ！ there "" The sentence " yes 「 The generalized , It doesn't mean a single sentence , It's a sequence of passages , It can contain one or more sentences 」, So when typing , In fact, there may be more than two sentences .

It seems that the original text should not be used Sentence pairs To express , And we should use Segment Pairs. In the rear RoBERTa The experiment verifies , If we use single sentence splicing as a sentence pair, as opposed to using continuous fragment splicing as a sentence pair , In fact, it damages the performance .

1.3 Pretraining task

1.3.1 Masked LM（MLM）

This task is similar to the cloze test done by high school students , Put the sentences in pairs WordPiece After processing , Combine the sentences in the corpus 15% Of token Cover up , Use [MASK] As a shielding symbol , Then predict what the covered word is . For example, for the original sentence my dog is hairy , After shielding my dog is [MASK]. In this mission , The last layer of the hidden layer [MASK] The vector corresponding to the tag is fed to a corresponding vocabulary softmax layer , Make word classification prediction .

Q： Why did you choose 15% Of wordpiece token You can't All use MASK Instead of , And want to use 10% Of random token and 10% Original token

MASK Is to tell the model in an explicit way 『 I won't tell you the word , Guess for yourself from the context 』, So as to prevent information leakage . If MASK All the other parts are in the original token, The model will learn 『 If the current word is MASK, Infer this word from the information of other words ; If the current word is a normal word , Just copy and input directly 』. thus , stay finetune Stage , All words are normal words , The model just copies all the words , Do not extract dependencies between words .

Fill in with a certain probability random token, Is to keep the model on the dike at all times , In any token All positions need to be put into the current token And the information inferred from the context . thus , stay finetune On the normal sentence of the stage , The model will also extract these two aspects of information at the same time , Because it doesn't know what it sees 『 Normal words 』 Have you ever been passive .

Q： Finally, how to use MASK token Make predictions ？

The final loss function is calculated only by mask Dropped token Of , In each sentence MASK The number of is indefinite . The actual code implementation is that each sentence has a maximum number of predictions, Take all MASK And some PADDING Take out the vector of position and make a prediction （ Make a total of maximum number of predictions So many predictions , It's fixed length ）, Then use the mask to PADDING Cover it off , Only calculate MASK Part of the loss .

But this raises a problem ：「 Pre training and downstream tasks , The input is inconsistent , Because when working downstream , The input is basically without 【MASK】 Of , This inconsistency will damage BERT Performance of 」, This is also one of the improvement directions of the later research ）, Of course BERT It has also made a little relief , Just not 15% Of token Use both 【MASK】 Instead of , It is 15% Of 80% use 【MASK】 Instead of ,10% Replace with random words ,10% Use the same words as before .

1.3.2 Next Sentence Prediction（NSP）

Judge whether sentence pairs are really continuous sentence pairs .

The sentences in the corpus are ordered adjacency , We use [SEP] As a separator for sentences ,[CLS] As a classification symbol of sentences , Now, some sentences in the corpus are scrambled and spliced , Form the following training samples ：

Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext

Bottom of the picture above 3 individual Embedding The specific meaning is as follows ：

Token Embedding Is a normal word vector , namely PyTorch Medium nn.Embedding()
Segment Embedding The function of is to use embedding The information makes the model separate the upper and lower sentences , We gave the last sentence token whole 0, In the next sentence token whole 1, Let the model judge the starting and ending positions of the upper and lower sentences , for example

[CLS] My dog is very cute [SEP] Penguins are not good at flying [SEP]
 0   0 0 0 0 0 0 0  1 1 1 1 1 1 1 1

Position Embedding and Transformer The difference between , Not a trigonometric function , It's learning

Enter the network , For the last layer of hidden layer [CLS] The words of symbols are embedded in softmax Two classification , Make a prediction about whether two sentences are adjacent dichotomous tasks .

It can be seen that , Both tasks learn how to input marker symbols during training embedding, And then based on the last layer embedding Just add an output layer to complete the task . This process may also introduce some special lexical symbols , By learning special symbols such as [CLS] Of embedding To accomplish specific tasks .

1.3.3 Ablation Experiment

Contrast out NLP Task and put the original MLM The mission changed to LTR（Left-to-Right） Mission , The experimental results are as follows , Show the original MLM and NSP Tasks are indispensable .

The bigger the model is , The better the performance

「 hold BERT As feature-based normal form , instead of fine-funing normal form 」. The specific method is to put BERT Take out the vectors of some layers of , As token Of embedding（ these embedding In the rear fine-tuning Do not update in the task ）, By analogy , use word2vec As token Characteristics of , Then the specific task layer is followed , It's just that the word2vec Vector is used BERT The output of some layers as a substitute , If you directly use BERT embeddings As Feature, Naturally, everyone token Of feature It's all fixed （ It's a bit like using pre trained word2vec Vectors as features ）, If you take the following layer （ Every token Of feature Dissimilarity , It's kind of like ELMo）, Experimental proof ,BERT No matter what it is feature-based still fine-tuning Methods are very effective .

1.4 Fine-tuning

First, the original model is properly constructed , Generally, you only need to add an output layer to complete the task .Bert This paper gives some examples of model construction for several common tasks , as follows ：

chart a and b It's a sequence level task ,c and d It's a word level task .a Do the sentence pair classification task ,b Do the single sentence classification task , The structure is very simple , Put the red arrow in the figure to [CLS] The corresponding hidden layer output is followed by softmax Output layer .c What we do is reading comprehension ,d What we do is named entity recognition （NER）, The model construction is similar , Take the hidden layer output corresponding to some words pointed out by the arrow in the figure and connect it to a classification output layer to complete the task .

Design of tasks similar to the above , The pre training model can be fine-tuning To various tasks , But it doesn't always apply , There are some NLP The task is not suitable to be Transformer encoder The schema represents , Instead, you need a model architecture that fits a particular task . Therefore, the feature-based method is useful .

1.4.1 Classification task classification

Add a symbol representing classification at the beginning of the input sentence [CLS], Then place the... In that position output, Throw it to Linear Classifier, Let it predict One class that will do . The whole process Linear Classifier The parameters need to be learned from scratch , and BERT Just fine tune the parameters in .

Why use the first position output as the classification basis ？

because BERT The internal is Transformer, and Transformer Inside again Self-Attention, therefore [CLS] Of output It must contain the complete information of the whole sentence , There is no doubt that . however Self-Attention Vector , You and your value actually account for the majority , Now suppose you use w1 Of output Make a classification , So this output Will actually pay more attention to w1, and w1 It is another word or word with practical significance , This will inevitably affect the final result . however [CLS] There is no practical significance , It's just a placeholder , So even if [CLS] Of output It doesn't matter that you account for the majority of your value in .

Of course, you can also put all the words output Conduct concat, As final output.

1.4.2 Slot filling slot filling

Put each word in the sentence at the corresponding position output Send different Linear, Predict the label of the word . In fact, this is essentially a classification problem , Just predict a category for each word

1.4.3 Natural language reasoning NLI

That is, given a premise , Then give a hypothesis , The model needs to judge that this assumption is correct 、 Wrong or don't know . This is essentially a three category problem , and Case 1 almost , Yes [CLS] Of output Just make a prediction

1.4.4 Question and answer QA

Put an article , And a question （ The example here is relatively simple , The answer must appear in the article ） Into the model , The model will output two numbers s,e, These two numbers represent , The answer to this question , Fall on the... Of the article s Words to e Word .

First pass the questions and articles through [SEP] Separate , Send in BERT after , Get the Yellow output in the figure above . At this point, we have to train two vector, That is, the orange and yellow vectors in the figure above . First, compare the orange and all the Yellow vectors dot product, And then through softmax, See which output has the largest value , For example, in the figure above d2 The corresponding output probability is the largest , Then we think s=2

similarly , We use the blue vector and all the Yellow vectors dot product, The final prediction is d3 With the highest probability , therefore e=3. Final , The answer is s=2,e=3

The final output s>e There is no answer .

1.5 result -9 term GLUE Mission

General Language Understanding Evaluation It includes many tasks of natural language understanding .

1. MNLI

Multi-Genre Natural Language Inference It is a crowdsourcing large-scale text implication task .

to 2 A sentence , Judge the relationship between the second sentence and the first sentence . contains 、 contradiction 、 Neutral

2. QQP

Quora Question Pairs

to 2 A question , Determine whether the semantics are the same

3. QNLI

Question Natural Language Inference It's a binary task , from SQuAD Data becomes .

to 1 individual ( problem , The sentence ) Yes , Judge whether the sentence contains the correct answer

4. SST-2

Stanford Sentiment Treebank, 2. Classified tasks , Extract from movie reviews .

to 1 Comment sentences , Judge emotions

5. CoLA

The Corpus of Linguistic Acceptablity, 2. Classified tasks , To judge whether an English sentence conforms to grammar

to 1 An English sentence , Judge whether it conforms to the grammar

6. STS-B

The Semantic Textual Similarity Benchmark, Multi category tasks , Judge the similarity of two sentences ,0-5. It consists of news headlines and others

to 2 A sentence , Look at the similarity

7. MRPC

Microsoft Research Paraphrase Corpus,2 Classification task , Judge whether two sentences are semantically equivalent , It consists of online news .05 Year of ,3600 Training data .

to 1 A sentence is right , Judge 2 Whether the two sentences have the same semantics

8. RTE

Recognizing Textual Entailment, 2. Classified tasks , Be similar to MNLI, But only contains or does not contain . Less training data

9. WNLI

Winograd NLI A small data set NLI. It is said that there is a problem with the official website evaluation . Therefore, the following evaluation does not include this

GLUE Evaluation results

about sentence-level Classification task of , Only CLS The output vector of the position is used for classification .

SQuAD-v1.1

SQuAD Belong to token-level The task of , Not with CLS Location , Instead, use the vectors of all the article positions to calculate the start and end positions .

Finetune 了 3 round , The learning rate is 5∗10−55∗10−5,batchsize by 32. The best results have been achieved

NER

SWAG

The Situations With Adversarial Generations Is a common sense reasoning data set , It is a four classification problem . Give a background , Choose a scenario that is most likely to happen .

Finetune 了 3 round , The learning rate is 2∗10−52∗10−5,batchsize=16

1.6 BERT Deficiency and improvement direction of the following sequence

The model is too big , Training is too slow Too big for the model ：BERT Model compression of , Various DistilBERT,BERT-PKD And Huawei's TinyBERT, Both proposed a knowledge distillation based approach to model compression , And achieved very good results . meanwhile Albert The model structure has also been transformed , Used parameter sharing as well as embedding Of factorization, Reduce the model parameters , The effect of the model is better . Too slow for training ：
- have access to LAMB Optimizer , So that the model can use larger batch size
- Use Nvidia Of Mixed Precision Training, It can be done without reducing the effect of the model , Greatly reduce the storage space of parameters
- In addition, due to transformer The limitation of ,google The latest proposal REFORMER : THE EFFICIENT TRANSFORMER, Reduce time and space complexity to $O(LlogL)$, I believe it will be the latest research prospect .
- Distributed training
- tensorRT Accelerate Forecasting （GPU）,CPU Accelerated reference （bert-as-service）.
- vocabulary size A very large , For Chinese, it is recommended to use CLUE Benchmark Team pre training model vocabulary, It can greatly increase the speed , Vocabulary size becomes smaller , But the accuracy remains the same .
Completely trained ？ RoBERTa Put forward ,BERT Not fully trained , Just use more data , Train more rounds , You can get more than XLNET The effect of . meanwhile Albert Also put forward , Model in big data There is no overfitting, Removed dropout after , It's better .
Is the training efficient ？ In pre training , We only pass through 15% Of masked tokens To update parameters , and 85% Of token It has no effect on parameter update ,ELECTRA It is found in the paper that , use 100% Of tokens It can effectively improve the model effect .
Position encoding Is the method good ？ bert The absolute parameter type is used position encoding, Huawei is based on bert Proposed Chinese pre training model NEZHA In this paper, a new function of relative position is proposed position encoding Methods , And say more than the current bert Better way . meanwhile Self-Attention with Relative Position Representations This article also proposes a relative position encoding The way , And it is better than transformer Better .
MASK Is the mechanism good ？ BERT It's using MASK wordpiece The way , Baidu ERNIE,BERT-WWM, as well as SpanBERT It's all proven mask A continuous paragraph of words is more effective than mask wordpiece better . Besides ,RoBERT Adopted a Dynamic Masking The way , It is dynamically generated during each training MASK. Besides ,MASK The mechanism is finetuning There is no , This is a big problem of the model .
MASK token In pre training , But in finetuning There is no . This problem XLNET The solution is to use Auto-regression, And in the ELECTRA in , Adopted is through a small generator To generate replaced token, however ELECTRA The paper points out that , This discrepancy The effect is small .
Loss Is it useful? ？ XLNET,SpanBERT,RoBERTa, and ALbert It is found that NSP loss It counteracts the downstream tasks of the model ,Albert The concrete analysis is given .
Loss Is that enough? ？ For the later research , Different models have more or less added their own definitions loss objectives, for example Albert Of SOP etc. , Microsoft MT-DNN Even directly regard the downstream task as the multi task target of pre training ,ERNIE2.0 Put forward a variety of different training objectives , All this proves that , The power of the language model , And many different loss It is effective for model training .
Unable to generate natural language NLG XLNET as well as GPT All are auto regressive Model of , Can generate languages , however BERT Its mechanism limits its ability , But the current findings have changed mask The mechanism of , It can make BERT The model has NLG The ability of , Unified NLU and NLG List of , See UniLM

2. RoBERTa

RoBERTa The full name of is called Robustly optimized BERT approach.RoBERTa To BERT The changes are simple , It mainly uses more data , Training , Use dynamic 【MASK】、 Remove the next predicted NSP Mission 、 Bigger batch_size、 Text encoding .

Final effect ：

Here is a brief explanation of the differences ;

2.1 dynamic MASK

Each of the pre training step, Is to re select 15%token Conduct 【MASK】 Of , and BERT Is constant , For the same input sample , In different epoch, The input is the same , The experimental results are shown in the figure below , There is a slight improvement .

2.2 Go to NSP Mission

FULL-SENTENCES and DOC-SENTENCES All removed NSP Mission , You can see that NSP The task performance is better than the original ,FULL-SENTENCES You can sample sentences across documents .DOC-SENTENCES Is to ensure that the sampled sentences are in the same document , You can see DOC-SENTENCES Perform a little better .

final RoBERTa Is to remove NSP And a sample is sampled from the same document .

2.3 Bigger batch_size

BERT Of batch_size yes 256, All in all, I trained 1M Step , Experimental proof , Adopt a larger batch_size And more steps , Can improve performance , So the last RoBERTa Adopted batch_size yes 8K.

3. ALBERT

ALBERT The whole process is A Lite BERT, A method of reducing parameters is proposed to increase the scale of the model , It is also proposed that SOP Training tasks .

In essence ,ALBERT-large The performance of this version is better than BERT-large The performance of the version is poor , The performance is good ALBERT The version is xlarge and xxlarge edition , And these two models , Although it is better than BERT-large Less parameters , But because the scale of the model has grown , So the training time is slow , The speed of inference has also slowed down .

therefore ALBERT Not as the name says , It belongs to the lightweight model .

As the parameters of the model become less , therefore , We can train larger networks , Concrete ALBERT-xxlarge The version is also 12 layer , however hidden_size by 4096！ control BERT-large and ALBERT-xxlarge The training time is the same , You can see ALBERT-xxlarge The training speed time of the version is only BERT-large Of 1/3 about , A lot slower , This is a side effect of the larger model rules . But because the model rules get bigger , So the performance of the model has also been improved , People often say that people brush the list ALBERT, It's actually xxlarge edition , ordinary large Version performance is better than BERT Of large The version is poor .

Let's see below. ALBERT And BERT Optimization point ;

3.1 Reduce parameters

3.1.1 Matrix decomposition

This is from the input embedding Dimension to reduce parameters ,BERT It's using WordPiece, There are about 3K individual token, And then native BERT Adopted embedding size with hidden size Same , All for 768, So the parameter quantity is about 3000 * 768 = 2304000. Suppose we replace the original by a matrix factorization embedding matrix , As shown in the figure above ,E Take 128, Then the parameter quantity becomes 3000 * 128+128 * 768=482304, The parameter quantity changes to the original 20%！

Think about a problem ： Will such decomposition affect the performance of the model ？ The angle given by the author is ,WordPiece embedding Is context independent ,hidden-layer embedding（ namely Transformer Structure each encoder Output ） It is context related , and BERT The power of is mainly attention Mechanism , That is, given according to the context token It means , therefore WordPiece embedding size It doesn't need to be too big , because WordPiece embedding No BERT The main reason for this strength .

3.1.2 Parameters of the Shared

Thought is ,BERT Of Transformer Sharing the 12 Layer of encoder, Let this 12 Layer of attention Layer and full connection layer share parameters , The author also finds that this method can stabilize the network parameters . Actually, see the experimental results in the table below , Full sharing （attention Both the layer and the full connection layer share ） Is better than simply sharing attention The effect of layer is worse , But all sharing d Too many parameters have been reduced , So the author adopts full sharing .

3.2 SOP Instead of NSP

Later researchers found that ,NSP to BERT Bring bad effects , The main reason is with MLM Task compared to , The task is too difficult .

Concrete , hold NSP , respectively, topic prediction（ Subject prediction ） and coherence prediction（ Consistency prediction ）, Obviously NSP It is more inclined to subject prediction （ Predict whether a sentence pair is a continuous segment of the same document ）, and topic prediction relative clherence prediction It's relatively simple .

SOP Replace the negative sample with two sentences in reverse order in the same article , To eliminate topic prediction, Make model learning more difficult coherence prediction.

3.3 n-gram MASK

forecast n-gram fragment , Contains more complete semantic information . The length of each segment is taken as n（ The largest one in the paper is 3）. Take... According to the formula 1-gram、2-gram、3-gram The probabilities are 6/11,3/11,2/11. The longer the time, the smaller the probability .

4. ERINE 1.0

ERNIE1.0 with BERT The same architecture , And BERT The difference is , The difference lies in the training task .

Here is a brief description of the improvement points

4.1 Knowledge Integration

Concrete , hold MASK Divided into three parts

Basic-level Masking： And BERT equally
Entity-level Masking： Take... As a whole MASK, for example J.K.Rowling The word as an entity , Be together 【MASK】
Phrase-Level Masking： Take the phrase as a whole MASK, Such as a series of As a whole , Be together 【MASK】

4.2 Dialogue Language Model（DLM）

Added the task of dialog data , As shown in the figure below , Data is not in the form of a single round of questions and answers （ That's the problem + answer ）, It's data from multiple rounds of questions and answers , It can be QQR、QRQ wait . Same as above , It's also the order inside token、 Entity 、 The phrase 【MASK】 fall , Then predict them , In addition, when generating data , There is a certain chance to replace the questions and answers with other sentences , So the model should also predict whether it is a real question and answer right . Papers mentioned DLM Tasks can make ERNIE Learn the implicit relationship in the dialogue , Increase the semantic expression ability of the model .

Look for the Segment Embedding By Dialogue Embedding Instead of , But other structures are similar to MLM The model is the same , So you can MLM Mission joint training .

5. ERINE 2.0

ERNIE2.0 Structure and ERNIE1.0 、BERT The structure is the same ,ERNIE2.0 It mainly improves the effect by modifying the learning task of pre training . from BERT Introduction , It has been widely used for nearly three years , In recent years, there have been many other pre training models , One thing most of them do is 「 It is more difficult to put forward 、 More diverse pre training tasks , So as to increase the learning difficulty of the model , Let the model have better words 、 grammar 、 Semantic representation 」ERNIE2.0 It is so , Three types of unsupervised tasks are constructed . For multitasking training , Continuous multi task learning is also proposed , See the figure below for the overall framework .

5.1 improvement 1： Continuous multi task learning

Let the model learn at the same time 3 A mission （ It is the current hot joint training ） Here are three strategies ：

A strategy ,Multi-task Learning, Let the model learn this at the same time 3 A mission , Specifically let this 3 The weight of the loss function of each task is double added , And then back-propagation together ;
Strategy two , First train the task 1, Retraining mission 2, Retraining mission 3, The disadvantage of this strategy is that it is easy to forget the training results of the previous task , For example, the model trained at last is easy to perform tasks 3 Over fitting ;
Three strategies ： Continuous multi task learning , In the first round , First train the task 1, But it doesn't completely let him finish his training , The second round , Training together 1 And tasks 2, Similarly, we will not let the model converge completely , The third round , Train three tasks together , Until the model converges .

The idea of strategy three is adopted in the thesis .

Concrete , As shown in the figure below , Each task has an independent loss function , Sentence level tasks can be trained together with word level tasks

5.2 improvement 2： More unsupervised pre training tasks

The structure of the model is shown in the figure below , Because it's multitasking , There is an extra one when inputting the model Task embedding.

What are the three types of unsupervised training tasks ？ What tasks are included in each category ？

Task a ： Lexical level pre training task
- Knowledge Masking Task： This task is the same as ERNIE 1.0 equally , Put some words 、 The phrase 、 Entity 【MASK】 fall , forecast 【MASK】 words .
- Capitalization Prediction Task： Predict whether words are upper case or lower case , Capitalization appears in entity recognition, etc , Lowercase can be used for other tasks .
- Token-Document Relation Prediction Task： In paragraph A What happened in token, Whether in the paragraph of the document B It appears that .
Task 2 ： Language structure level pre training task
- Sentence Reordering Task： Confuse the sentences in the document , Identify the correct sequence .
- Sentence Distance Task： The distance between classified sentences （0： Connected sentences ,1： Unconnected sentences in the same document ,2： Sentences between two documents ）.
Task three ： Statement level pre training tasks
- Discourse Relation Task： Calculate the semantic and rhetorical relationship between two sentences .
- IR Relevance Task： Short text information retrieval relationship , Search data （0： Search and click ,1： Search for elements and show ,2： irrelevant ）.

These tasks are all 「 Unsupervised pre training tasks 」！

6. ELECTRA

ELECTRA It is a relatively innovative model in recent years , Both model architecture and pre training tasks are related to BERT There is a certain degree of difference .ELECTRA The full name is Efficiently Learning an Encoder that Classifies Token Replacements Auucrately, It is pointed out at the beginning of the paper that BERT One drawback of training , Namely 「 Learning efficiency is too slow 」, Because the model can only learn from a sample 15% Of token Information , So the author puts forward a new architecture to let the model learn all the inputs token Information about , Not just by 【MASK】 Dropped tioken, In this way, the model learning efficiency will be better . Author points out ,ELECTRA Use the same data , Reach and BERT、RoBERTa、XLNET Fewer rounds are required for the same effect , If you use the same number of training rounds , It will go beyond the model mentioned above .

「 But look at the Chinese version released by Harbin Institute of technology ELECTRA Look at the model , Find nothing better than BERT It's better to wait , Even in some Chinese tasks, the performance is worse , For this model , I believe there are still a lot of disputes .」

6.1 New architecture ：Generator-Discriminator The architecture of

ELECTRA Its structure is very simple , By a Generator Generator and a DIscriminator The discriminator consists of , As shown in the figure below . First of all, for the words in a sentence token Conduct random 【MASK】, Then train a generator , Yes 【MASK】 Dropped token To make predictions , Usually generators don't need to be very large （ The reason is demonstrated in the following experiment ）, Generator pair 【MASK】 Dropped token After prediction , Get a new sentence , Then input the discriminator , The discriminator judges each token, Whether it is the original sample , Or replaced .

Pay attention to is , If the generator predicts token Original token, So this one token In the output tag of the discriminator, the original sample is still counted , Instead of being replaced （ The following figure the, The generator predicts the, be the The real label in the discriminator is original, instead of replaced）. Since then , The idea of the whole model architecture is introduced , Is it simple ？

6.2 Weight sharing

If the generator and the discriminator adopt the same architecture , Then the two models can share the weight , If not the same architecture , Can also share embedding layer . So the author has done experiments on the following three cases ：

The parameters of generator and discriminator are independent , Not shared at all ;
Generator and discriminator embedding Parameters of the Shared , And the generator input Layer and the output Layer of embedding Parameters of the Shared （ Think about why you can do this ？ Because the generator is finally a classification of the whole vocabulary , So it can follow the input embedding The dimensions of the matrix are consistent , The discriminator is finally a binary classification , So you can't share the input embedding matrix ）, Other parameters are not shared ;
The parameters of generator and discriminator are shared .

First option GLUE score by 83.6, Second option GLUE score by 84.3, The third option GLUE score by 84.4, From the results , First of all, it is certain that sharing parameters can improve the effect , The reason given by the author is , If you don't share parameters , The discriminator will only be correct for 【MASK】 Of token Of embedding updated , The generator will update the weight of the whole vocabulary （ There are doubts here to think about , The generator finally made a classification of the whole vocabulary ）,「 So sharing parameters is definitely necessary 」, As for why the author finally adopted scheme 2 or scheme 3 , Because if scheme 3 is adopted , It is limited that the model structure of generator and discriminator should be the same , It greatly affects the efficiency of training .

7. XLNET

First, two unsupervised objective functions are introduced ：

AR(autoregressive)： Autoregression , Suppose that there is a linear relationship between sequence data , use x0..xt-1 forecast xt . The traditional one-way language model （ELMo、GPT） It's all about AR As the target .
AE(autoencoding)： Self coding , Copy input to output .BERT Of MLM Namely AE A kind of .

AR It's a commonly used method , But the disadvantage is that it can't encode in both directions . therefore BERT Adopted AE, Get the global information of the sequence . But the author points out that BERT use AE Two problems caused by the method ：

BERT There is an assumption that does not fit the real situation ： Namely be mask Dropped token It's independent of each other . For example, input during pre training ：“ natural Mask Handle ”, The objective function is actually p( language | Natural treatment )+p( said | Natural treatment ), And if you use AR It should be p( language | natural )+p( said | natural form ). So down BERT The obtained probability distribution is also based on this assumption , Ignore these token The connection between .
BERT There are differences in pre training and fine tuning stages ： Because in the pre training phase, most of the input contains Mask, Noise is introduced , Even in a small number of cases, other token, But there are still differences with the real data .

That's all BERT use AE The pain points of the method , Next, please look at XLNet How to solve these problems .

In fact, the author's last experiment is not enough , No peace BERT Make a full equality comparison . But new ideas are hard to come by ：

Permutation Language Modeling： First, it unifies the ideological framework of the previous language model （AR or AE）, Another one permutation Combine the advantages of both , And the overall framework has returned to AR, The new generation model of sensation SOTA Point the day and await for it .
Transformer-XL + Relative segment encoding： This is not what the author emphasizes , But it makes me feel very useful , At present, the task of short text is OK , As soon as the text becomes more difficult , Paragraph level or even article level , These two operations let me see NLU The possibility of achieving greater results on long texts .

7.1 PLM Permutation Language Modeling

Rather than XLNet It's solved BERT The problem of , Rather, it is based on AR A new method is used to realize bidirectional coding , because AR The above two pain points do not exist in the method .

Theoretically

For length is T Sequence x, There is T! There are two ways to arrange them , If you put x1,x2,x3,x4 Rearrange into x2,x1,x4,x3 , Then use AR Is the target function , Then the likelihood of optimization is

Because for different arrangements , Model parameters are shared , So the model can eventually learn how to aggregate information from all locations .

Operationally

Due to the limitation of computational complexity , It is impossible to calculate all the sequence permutations , Therefore, only one permutation is sampled for each sequence input . And in actual training , Will not disturb the sequence , But through mask Matrix implementation permutation. The author specially emphasizes , In this way, you can keep up with finetune Consistency of input sequence , No existence pretrain-finetune differences .

7.1.1 Two-Stream Self-Attention

Solved the core problem , Next is the implementation details . In fact, there is a big problem after the order is disordered , Is predicting the third x When the model predicts P(x4|x2,x1) , If you change the arrangement to x2,x1,x3,x4 , It should be predicted P(x3|x2,x1) , But the model doesn't know which one to predict , So the output value is the same , namely P(x3|x2,x1) , That's not right . So we should Add location information , namely P(x4|x2,x1,4) and P(x3|x2,x1,3) , Let the model know which position is predicted at present token.

Then the next question comes , Conventional attention Only with token code , The location information is in the code , and AR The goal is not to allow the model to see the current token Coded , So we should put position embedding Take it out . How to dismantle it ？ The author put forward that Two-Stream Self-Attention.

Query stream： Only the current location information can be seen , Can't see the current token The coding

Content stream： Tradition self-attention, image GPT The same for the present token Encoding

The final prediction in the pre training phase only uses query stream, because content stream Have seen the current token 了 . Used in fine tuning phase content stream, Back to the traditional self-attention structure .

The picture below at least looks like 3 All over ～ Until you understand , Tutu is clearer than I said ..

in addition , Because they don't like MLM Only the forecast part token, You need to calculate permutation,XLNet The amount of calculation is even greater , So the author puts forward partial prediction Simplify , That is, only the following 1/K individual token.

7.2 reference Transformer-XL

7.2.1 Segment-level recurrence with state reuse

Due to the limitation of memory and computing power , At present, long text needs to be truncated , such as BERT Is the length of the 512, Can't handle long articles directly . The author refers to RNN Hidden memory unit of , Put forward Transformer-XL, Try to record the previous information , Let the following text segment read the information of the previous segment . If before reading 4 individual token Information about , It looks like the picture below ：

From the picture b You can see , In the top right corner of the token More information than the previous truncation . It feels a bit like CNN Feeling field of , As the depth increases, so does the field of vision .

Ideologically, it is through “hidden state” Pass the information from the previous segment to the next segment , But there is a problem in the implementation , Is absolute positional embedding, So both of these clips are good for yourself, but token Conduct 1,2,3,4... The coding of the location , The model cannot tell if a location is a fragment 1 Is still a fragment 2 Of . So the author puts forward relative positional encoding, That is, use two token The relative distance of is used to replace the previous absolute position . Please refer to the original text for specific details , The general approach is to calculate attention weight Take out the matrix related to the position and change it .

7.2.2 Multiple Segments modeling

BERT One more Next Sentence Prediction The goal of optimization , be conducive to finetune Stages are directly adapted to various types of downstream tasks .XLNet This structure can also be used , But the final conclusion of the study is NSP The task doesn't help it .

XLNet Put forward Relative Segment Encoding, Because before BERT It's direct distribution A、B sentence , Each sentence has a segment embedding,XLNet Learn from it relative position Thought , Judge only two token Is it in one segment in , Instead of judging which one they belong to segment.

The concrete implementation is to calculate attention weight When , to query An extra wave of operation , Calculate an additional weight and add it to the original weight , Follow relative positional encoding almost .

The advantage of this is that more can be handled later segments, Not like it BERT You can only handle two .

8. T5

T5 The corpus of , Used 750G Corpus of . Its core contribution is to NLU and NLG、 That is, natural language understanding and natural language generation are unified . Although this work has been done for a long time , however T5 The experiment is very comprehensive , Finally found encoder/decoder Is a very good structure ,T5 The final number of parameters has reached 110 One hundred million . The picture on the right shows T5 My pre training goal , Input inputs Some words are masked , Enter into encoder; stay decoder part , The model needs to generate masked words .

When T5 For downstream tasks , The text serves as encoder End input ,decoder Responsible for label output . in fact , Whether it's NLU still NLG Mission , You can use text Text is used to indicate their correct answer . Therefore, whether it is classification, translation or regression tasks , Can use the same seq2seq Model structure and the same training / Reasoning strategy .

9. summary

Pre training can bring the following advantages ：

By using a large number of unmarked text , Pre training is helpful to model learning and general language representation .
Just add one or two specific layers , The pre training model can adapt to downstream tasks . So this provides a good initialization , So as to avoid training the downstream model from scratch （ Just train task specific layers ）.
Let the model perform better with only a small data set , Therefore, the need for a large number of labeled instances can be reduced .
Due to the large number of parameters in the deep learning model , Therefore, when training with small data sets , Easy to overfit . Pre training can provide good initialization , Thus, over fitting on small data sets can be avoided , Therefore, pre training can be regarded as some form of regularization .