当前位置：网站首页>NLP baked gluten

NLP baked gluten

2022-06-24 04:53:00 【Goose】

0. Basics

0.1 Language foundation

python yeild return difference

https://l1nwatch.gitbook.io/interview_exercise/stackoverflow-about-python/python-zhong-guan-jian-zi-yield-you-shi-mo-zuo-yong

belt yeild The function is generator , Not a function , It can only be called once , Not all results are in memory

Generators can control resource access

Python Decorator

It's essentially a higher-order function ( Function of function ), Used to modify and add functions , Available for authentication Log etc.

python Multithreading Multi process

Only one core can be used Fake multithreading

python The original interpreter of CPython There is a GIL

Python The code is executed by Python virtual machine （ Interpreter ） To control .Python At the beginning of the design, it is considered to be in the main cycle , At the same time, only one thread is executing , Like a single CPU Running multiple processes in your system , Multiple programs can be stored in memory , But at any moment , There is only one program in CPU Run in . similarly , although Python The interpreter can run multiple threads , Only one thread runs in the interpreter . Yes Python Access to the virtual machine is locked by the global interpreter （GIL） To control , It is this lock that ensures that only one thread is running at the same time .

In a multithreaded environment ,Python The virtual machine executes as follows .1. Set up GIL.2. Switch to a thread to execute .3. function .4. Set the thread to sleep .5. Unlock GIL.6. Repeat the above steps again .

Multi process

https://www.liaoxuefeng.com/wiki/1016959663602400/1017628290184064

from multiprocessing import Pool

Interprocess communication adopt queue pipe

We use Queue For example , Create two child processes in the parent process , A to Queue Write the data , A from Queue Read in the data ：

Thread safety

Shared variables between threads And global Variable In multithreaded environment It's not safe

Need to use threading Synchronization mechanism

from threading import Thread, Lock, enumerate
import time
num = 0
mutex = Lock()
def add_num():
    global num
    for i in range(100000):
        mutex.acquire()
        num += 1
        mutex.release()

Read-write lock

Multiple threads can occupy the read-write lock in read mode at the same time , But only one thread can occupy the read-write lock in write mode .

import threading
class RWlock(object):
    def __init__(self):
        self._lock = threading.Lock()
        self._extra = threading.Lock()
        self.read_num = 0
    def read_acquire(self):
        with self._extra:
            self.read_num += 1
            if self.read_num == 1:
                self._lock.acquire()
    def read_release(self):
        with self._extra:
            self.read_num -= 1
            if self.read_num == 0:
                self._lock.release()
    def write_acquire(self):
        self._lock.acquire()
    def write_release(self):
        self._lock.release()

python2 python3 map The difference between

map() yes Python Built in higher-order functions , It receives a function f And a list, And by putting the function f Acting in turn list On each element of , Get a new one list And back to .

def f(x):
    return x*x
print map(f, [1, 2, 3, 4, 5, 6, 7, 8, 9])

Execution results ：1, 4, 9, 10, 25, 36, 49, 64, 81

But in python3 The return is

Now we just need to print(map(f,1,2,3,4)) It's written in print(list(map(f,1,2,3,4))) Just fine .

Because in python3 Receive a function in f And a list, And by putting the function f Acting in turn list On each element of , Get a new one tuple And back to .

So we directly force the conversion to ok 了

Python2 in ,map() Functional func It can be for None, Such as map(seq1,seq2[,...[,seqn), Its function is similar to that of seq* The value of the corresponding index in is taken as a tuple , Finally, a list containing multiple tuples is returned . and Python3 in ,map() Function if you do not specify func, Finally, the returned map Objects are thrown when they are converted "TypeError"

0.2 Fundamentals of computer

Threads

The smallest unit of executable scheduling in an operating system

Synchronization and mutex

Synchronization means that the program runs in a predetermined sequence .

Through the thread synchronization mechanism , It can guarantee the shared data at any time , At most one thread accesses , To ensure the correctness of the data .

Mutexes introduce a state for resources ： lock / Non locking

When a thread wants to change shared data , Lock it first , The status of the resource is “ lock ”, Other threads cannot be changed . Until the thread releases resources , Change the state of the resource to “ Non locking ”, Other threads can lock the resource again . Mutexes ensure that only one thread operates at a time , Thus, the correctness of data in the case of multithreading is guaranteed .

Deadlock prevention ：

Using reentrant locks Rlock The bottom layer maintains counters and mutexes , Different threads count++ It won't block
Avoid multiple locks
Ensure the locking sequence
Add the timeout timer lock Automatic release lock

Copy

Direct assignment ： It's just a reference to an object （ Alias ）.

Shallow copy (copy)： Copy the parent object , Does not copy the inner child of an object .

Deep copy (deepcopy)： copy Modular deepcopy Method , It completely copies the parent object and its children .

0.3 data structure

Quick line up

def quick_sort(array, l, r):
    if l < r:
        q = partition(array, l, r)
        quick_sort(array, l, q - 1)
        quick_sort(array, q + 1, r)
def partition(array, l, r):
    x = array[r]
    i = l - 1
    for j in range(l, r):
        if array[j] <= x:
            i += 1
            array[i], array[j] = array[j], array[i]
    array[i + 1], array[r] = array[r], array[i+1]
    return i + 1

Two points search

#  return  x  stay  arr  Index in , Returns if none exists  -1 
def binarySearch (arr, l, r, x):      
    #  Basic judgment  
    if r >= l:          
        mid = int(l + (r - l)/2) 
        #  The middle position of the element  
        if arr[mid] == x:              
            return mid 
        #  The element is smaller than the element in the middle , Just compare the elements on the left  
        elif arr[mid] > x:              
            return binarySearch(arr, l, mid-1, x) 
        #  The element is larger than the element in the middle , Just compare the elements on the right  
        else:              
            return binarySearch(arr, mid+1, r, x) 
    else:          
        #  non-existent  
        return -1

Tree traversal

Before the order Middle preface In the following order ( about + The root node Location of the root node )

1. Fundamentals of machine learning

1.1 Basic model

Activation function summary

https://zhuanlan.zhihu.com/p/73214810

sigmoid

sigmoid Function can smoothly map real number field to 0,1 Space .
The value of a function can be interpreted as a positive probability （ The range of probability is 0~1）, The center is 0.5.
sigmoid Function monotonically increasing , Continuous derivable , Derivative form is very simple , Is a more appropriate function

advantage ： smooth 、 Easy to find

shortcoming ：

The activation function is computationally expensive （ Both forward propagation and back propagation contain power operation and division ）;
When calculating the error gradient by back propagation , Derivation involves division ;
Sigmoid The derivative range is 0, 0.25, Due to the of neural network back propagation “ The chain reaction ”, It's easy to see the gradient disappear . For example, for a 10 Layer network , according to 0.25^10 Very small , The first 10 The error of the layer is relative to the parameter of the convolution of the first layer W1 The gradient of will be a very small value , That's what's called “ The gradient disappears ”.
Sigmoid The output of is not 0 mean value （ namely zero-centered）; This will cause the neurons of the latter layer to get the non output of the previous layer 0 Mean signal as input , With the deepening of the network , Will change the original distribution of the data .

deduction ：https://zhuanlan.zhihu.com/p/24967776

tanh

tanh Is a hyperbolic tangent function , Its English reading is Hyperbolic Tangent.tanh and sigmoid be similar , All belong to saturation activation function , The difference is that the output value range consists of (0,1) Change into (-1,1), You can put tanh The function is seen as sigmoid The result of translation and stretching down

tanh Characteristics as activation function ：

comparison Sigmoid function ,

tanh When the output range of (-1, 1), It's solved Sigmoid Function is not zero-centered Output problems ;
The problem of power operation still exists ;
tanh The derivative range is (0, 1) Between , comparison sigmoid Of (0, 0.25), The gradient disappears （gradient vanishing） The problem will be alleviated , But there will still be

DNN Use... In the front tanh Last use sigmoid

relu

Relu(Rectified Linear Unit)—— Correction of linear element function ： The form of this function is relatively simple ,

The formula ：relu=max(0, x)

ReLU Characteristics as activation function ：

comparison Sigmoid and tanh,ReLU Abandon complex calculations , Improved computing speed .
Solved the problem of gradient disappearance , The convergence rate is faster than Sigmoid and tanh function , But guard against ReLU Gradient explosion
It's easy to get a better model , But we should also prevent models from appearing in training ‘Dead’ situation .

ReLU Force will x<0 Part of the output is set to 0（ Set as 0 Is to mask the feature ）, It may cause the model to fail to learn effective features , So if the learning rate is set too large , It may cause most neurons of the network to be in ‘dead’ state , So use ReLU Network of , The learning rate cannot be set too large .

Leaky ReLU The formula in is a constant , General Settings 0.01. This function is usually better than Relu The activation function works better , But the effect is not very stable , So in practice Leaky ReLu Not much is used .

PRelu（ Parameterized modified linear element ） As a learnable parameter , It will be updated during training .

RReLU（ Random correction of linear elements ） It's also Leaky ReLU A variation of . stay RReLU in , The slope of negative value is random in training , In later tests it became fixed .RReLU The highlight is , In the training session ,aji It's from a uniform distribution U(I,u) A random number from .

RNN seq2seq LSTM GRU

https://cloud.tencent.com/developer/article/1869960

RNN Why do we use tanh instead of ReLu As an activation function ？

https://www.zhihu.com/question/61265076

relu May cause the gradient to disappear , Directly replace the activation function with ReLU Will result in very large output values , Replace the activation function with ReLU Nor can it solve the problem of gradient transmission over a long distance

Attention

Attention species

Soft/Hard Attention

soft attention： Tradition attention, It can be embedded into the model for training and propagation

hard attention： Do not calculate all outputs , According to probability encoder Output sampling of , In back propagation, Monte Carlo is used to estimate the gradient

Global/Local Attention

global attention： Tradition attention, For all encoder Output for calculation

local attention： Be situated between soft and hard Between , Will predict a location and select a window to calculate

Self Attention

Tradition attention It's calculation Q and K Dependencies between , and self attention Then calculate respectively Q and K Their own dependencies .

beam search

https://blog.csdn.net/weixin_38937984/article/details/102492050

Parameter calculation in neural network

https://zhuanlan.zhihu.com/p/33223290

XGBoost Principle and derivation

https://cloud.tencent.com/developer/article/1835890

XGBoost Feature importance ranking

https://zhuanlan.zhihu.com/p/355884348

https://www.jianshu.com/p/2920c97e9e16

weight

xgb.plot_importance This is our common function method for drawing feature importance . The contribution calculation method used behind it is weight.

‘weight’ - the number of times a feature is used to split the data across all trees.

Simply speaking , When the subtree model is split , The number of features used . All the trees are calculated here . This indicator is in R The bag is also called frequency.

weight A higher value will be given to the numerical characteristic , Because the more variables it has , The larger the space that can be cut when the tree is split . So this indicator , Will mask important enumeration features .

gain

model.feature_importances_ This is when we call the characteristic importance value , The default function method used . The contribution calculation method used behind it is gain.

‘gain’ - the average gain across all splits the feature is used in.

gain Is a generalization of information gain . Here means , When a node is split , This feature brings information gain （ Objective function ） Optimized average .

gain The concept of entropy increase is used , It can easily find the most direct features . That is, if the lower part of a feature 0, stay label It's all 0, Then this feature will definitely rank high .

cover

model = XGBRFClassifier(importance_type = 'cover') This calculation , You need to define... When defining the model . Call later model.feature_importances_ The result is based on cover Contribution obtained .

‘cover’ - the average coverage across all splits the feature is used in.

cover In terms of image , When the tree model is split , The number of samples covered by the leaf node under the feature divided by the number of times the feature is used to split . The closer the split is to the root ,cover The bigger the value is. .

cover For enumerating features , It will be more friendly . meanwhile , It also does not over fit the objective function , Not affected by the dimension of the objective function .

1.2 Training and tuning

batch_size Too big or too small a problem

https://www.zhihu.com/question/32673260

https://zhuanlan.zhihu.com/p/86529347

batch size Too small , Spend more time , At the same time, the gradient oscillation is serious , Not conducive to convergence ;

batch size Too big , Different batch There is no change in the direction of the gradient , Easy to fall into local minima .

learning_rate How to set up

Too small ： Slow speed , Over fitting

Too big ： Shock

0.01 - 0.001 , Gradually decay Attenuation multiple 100 times

L0 L1 L2 Regular differences

https://blog.csdn.net/zouxy09/article/details/24971995

https://www.zhihu.com/question/26485586

L0 norm （ The number of nonzero parameters ） Characteristic coefficient , Difficult to solve

L1 lasso Deletion has too much effect feature, It can be used for feature filtering , Difficult to optimize ( Gradient invariant ), Many features =0

L2 ridge Characteristic average , Prevent over fitting , Improve the generalization ability , Many features converge 0

Optimizer selection

https://cloud.tencent.com/developer/article/1872633

batch normalize

https://www.cnblogs.com/shine-lee/p/11989612.html

Formula principle

Let the input of a hidden layer be Ai−1, Similar formula （2）, First, we need to find the intermediate value

Zi=Wi∗Ai−1+bi(3)

Then the activation value is calculated Ai Before BN, The process is as follows

Set the current mini-batch Yes m Samples , Corresponding m In the middle , Respectively Z(1)i、Z(2)i、Z(3)i……Z(m)i

① First, find the mean and variance of the current median

mean value :μ=1m∑mj=1Zji　　　 variance :δ2=1m∑mj=1(Zji−μ)2

② Then proceed normalization, And formula (1) be similar

Zjinorm=Zji−μδ2+ϵ√(4)

You can compare the above formulas (1), The slight difference here is δ2+ϵ−−−−−√ replace δ It's to avoid denominator 0

Go to this point ,Zinorm It has a mean value of 0, The variance of 1 The variable of , But we don't want the distribution of all layers to be the same all the time , After all, we hope that the hidden layer can learn enough , And the distribution of data itself is what we need to learn , So we calculate further

③ Zoom and pan appropriately

bni=γZinorm+β(5)

Scaling parameters γ Move parameters peacefully β, So that the average value of the hidden layer becomes β, Variance becomes γ, These two parameters participate in the training of the model , Just like the weight matrix W Use gradient descent to update parameters , As the model iterations gradually change

④ Finally, use the activation function

Ai=g(bni)(6)

In the process , We ensure that the data distribution of the activation value of each layer is a normal distribution , The reflected data shape is similar to the figure above ②, As a regular pit , It can also make each layer not in normalization After that, too much knowledge that should have been learned was lost

To make a long story short

BN Played a similar role normalization The role of , The distribution of data has changed , Make the value distribution of the loss function close to the circular pit , Like the picture above ②, Thus, a greater learning rate can be used , It accelerates the convergence of the network .

Based on this benefit , I basically add... When making any deep learning model BN Handle , This will make my tuning parameters very willful , The learning rate is set to be large .

Use Batch Normalization, You can get the following benefits ,

Sure Use a greater learning rate , The training process is more stable , Greatly improved training speed .
Can be bias Set as 0, because Batch Normalization Of Standardization The process will remove the DC component , So you don't need it anymore bias.
No longer sensitive to weight initialization , Usually the weight is sampled from 0 The Gaussian distribution of a certain variance of the mean , In the past, it is very important to set the variance of Gaussian distribution , With Batch Normalization after , The weight connected to the same output node is scaled back , The standard deviation σ It's going to shrink by the same amount , To cancel out .
No longer sensitive to the scale of weight , For the same reason , The scale is unified by γ Parameter control , Decide in training .
Deep networks can be used sigmoid and tanh 了 , For the same reason ,BN Inhibit the gradient from disappearing .
Batch Normalization Have a certain Regular action , Don't rely too much on dropout, Reduce overfitting .

The gradient disappears Gradient explosion CEC Mechanism

https://www.jianshu.com/p/3f35e555d5ba

https://zhuanlan.zhihu.com/p/25631496

https://zhuanlan.zhihu.com/p/51490163

The picture shown contains 3 Hidden layer neural network , When the gradient vanishing problem occurs , Close to the output layer hidden layer 3 The update of the weight of etc. is relatively normal , But the one in front hidden layer 1 The update of the weight of will become very slow , As a result, the weight of the previous layer is almost unchanged , Still close to the initialization weight , This leads to hidden layer 1 It's equivalent to just a mapping layer , All inputs are mapped the same way , This is the learning of the deep network is equivalent to the learning of the shallow network only in the later layers

If you use sigmoid function The values and derivatives are as follows

resolvent ：

Gradient explosion ：

If the model parameter is not (-1,1) Number between , For example 50, Yes w1 When seeking derivative , There will be many multiplications of large numbers , There will be problems updating parameters , Unable to complete e-learning

resolvent ： Reasonable initialization model parameters

The gradient disappears ：
Use Relu leak-relu elu Wait for the activation function

Relu: The idea is very simple , If the derivative of the activation function is 1, Then there is no problem of gradient vanishing explosion , Each layer of the network can get the same update speed .

f(x) = max(0,x)

relu The advantages of ：

* Solved the gradient disappear 、 The problem of explosion

* Convenient calculation , Fast calculation

* Accelerated network training

relu The shortcomings of ：

* Because the negative part is always 0, It can cause some neurons to fail to activate （ It can be partly solved by setting primary school study rate ）

* The output is not 0 Centred

Even though relu There are also shortcomings. , But it is still the most used activation function at present .

leaky relu,f(x) = max(α*z,z),α=0.01, It's solved 0 The impact of the interval .

elu The activation function is also used to solve relu Of 0 The impact of the interval , Its mathematical expression is ：

f(x) = x, (x>0)

f(x) = a(e^x - 1), (otherwise)

Preliminary training

This method comes from Hinton stay 2006 A paper published in ,Hinton To solve the problem of gradients , An unsupervised layer by layer training method is proposed , The basic idea is to train one layer of hidden nodes at a time , During training, the output of the hidden node in the previous layer is taken as the input , The output of the hidden node in this layer is the input of the hidden node in the next layer , This process is layer by layer “ Preliminary training ”（pre-training）; After the pre training , And then the whole network “ fine-tuning ”（fine-tunning）.Hinton In training depth belief network （Deep Belief Networks in , Using this method , After each layer of pre training , recycling BP The algorithm trains the whole network . This idea is equivalent to finding the local optimum first , And then integrate to find the global optimum , This method has some advantages , But at present, there are not many applications .

batch normalize

The output signal will be... By normalized operation x Normalized to the mean value of 0, The variance of 1 Ensure the stability of the network , Eliminated w The effect of zooming in and out , Then solve the problem of gradient disappearance and explosion .

Residual network
LSTM
Gradient shear

The idea is to set a gradient shear threshold , And then when you update the gradient , If the gradient exceeds this threshold , Then we should restrict it to this range . This prevents gradient explosions .

Another way to solve gradient explosion is to use weight regularization （weithts regularization）

Training verification Test set

Set aside method Keep one k Crossover verification

Set aside method （Holdout cross validation）

As mentioned above , Set the data set in a fixed scale Static Divided into training sets 、 Verification set 、 Test set . The way to do this is to set aside .

Keep one （Leave one out cross validation）

Each test set has only one sample , To carry out m Second training and prediction . The training data of this method is only one sample less than the whole data set , So it's closest to the distribution of the original sample . But the training complexity has increased , Because the number of models is the same as the number of samples of the original data . Generally used when data is scarce .

k Crossover verification （k-fold cross validation）

Static 「 Set aside method 」 Sensitive to the way data is divided , It's possible that different ways of partition lead to different models .「k Crossover verification 」 It's a way of dynamic validation , This approach can reduce the impact of data partitioning . The specific steps are as follows ：

The data set is divided into training set and test set , Put the test set aside
Divide the training set into k Share
Each use k Of course 1 As a verification set , Everything else as a training set .
adopt k After training , We got k A different model .
assessment k The effect of a model , Choose the best super parameter
Use the best super parameters , And then k All the data are used as training set to retrain the model , Get the final model .

https://easyai.tech/ai-definition/3dataset-and-cross-validation/

2. NLP

Types and characteristics of various word vectors ：

species

be based on one-hot、tf-idf、textrank Waiting bag-of-words;
Theme model ：LSA（SVD）、pLSA、LDA;
Fixed representation based on word vector ：word2vec、fastText、glove
Dynamic representation based on word vector ：elmo、GPT、bert

characteristic

Matrix decomposition （LSA）： Using global corpus features , but SVD The calculation complexity is large ;
Matrix decomposition （LSA）： Using global corpus features , but SVD The calculation complexity is large ;
be based on NNLM/RNNLM The word of the vector ： Word vectors are by-products , The efficiency is not high ;
word2vec、fastText： High optimization efficiency , But based on local corpus ;
glove： Based on global expectations , Combined with the LSA and word2vec The advantages of ;
elmo、GPT、bert： Dynamic characteristics ;

2.1 W2V Glove FastText

word2vec and NNLM What's the difference between comparison ？（word2vec vs NNLM）

1） Its essence can be regarded as a language model ;

2） Word vectors are just NNLM A product ,word2vec Although its essence is also a language model , But it focuses on the word vector itself , So many optimizations have been made to improve the computational efficiency ：

And NNLM comparison , Word vectors are direct sum, No more splicing , And discard the hidden layer ;
in consideration of sofmax Normalization requires traversing the entire vocabulary , use hierarchical softmax and negative sampling To optimize ,hierarchical softmax In essence, a Huffman tree with the smallest weighted path is generated , Let the high-frequency word search road strength become smaller ;negative sampling More direct , In essence, each word in each sample is negatively sampled ;

word2vec Which of the two training methods is better ？

Who is better at rare words ？

CBOW In the model input yes context（ Words around ） and output It's the central word , The training process is actually from output Of loss Learn the information of the surrounding words, that is embedding, But in the middle layer is average Of , A total forecast V(vocab size) Time is enough .

skipgram Is to use the central word to predict the surrounding words , When predicting, it's a pair of word pair, It's equal to having... For every central word K A word as output, The prediction of a word has K Time , So it can be more effective from context Learn information from , But the total forecast is K*V word .

skipgram Win

word2vec Hoffman tree Hierarchical Softmax

https://blog.csdn.net/zynash2/article/details/81636338

https://zhuanlan.zhihu.com/p/52517050

Huffman tree (Huffman Tree) Also known as weighted path length shortest binary tree , Or optimal binary tree . Such as A:15,B:10,C:3,D:5, The following figure shows the Huffman tree ： Weighted path length WPL=3*3+5*3+10*2+15*1=59

stay CBOW in , The output layer is a tree According to word frequency Constructed Huffman tree . Actually in word2vec Before , That is, there is a similar three-layer neural network structure to make word vector model , One big difference is , The previous method output layer is a softmax layer , From the hidden layer to the output layer, we need to calculate the softmax probability , It takes a lot of calculation , The output layer is Huffman tree, which avoids the calculation of all words softmax probability . Therefore, the mapping from the hidden layer to the output layer is done step by step according to the tree structure . The advantage of using Huffman tree is , That is, the amount of calculation is reduced , And because of the Huffman tree constructed according to word frequency , High frequency words can be found in a short time .

Negative sampling

https://cloud.tencent.com/developer/article/1780054

be based on Hierarchical Softmax Of CBOW The model has an obvious drawback ： Because Huffman tree is constructed according to word frequency , So when w For rare words that appear very few times , It takes a long search to reach w The node .Negative Sampling Method does not use Huffman tree , By sampling , It improves the efficiency of model training .

Training a network is , Calculate the training samples and then slightly adjust the weights of all neurons to improve the accuracy . let me put it another way , Every training sample needs to update the weights of all neural networks .

As I said above , When the vocabulary is very large , So many neural network parameters in such a large amount of data , We need to update the weight every time , It's a big burden .

During each sample training , Only modify part of the network parameters , Negative sampling solves this problem in this way .

When our neural network trains to the word group (‘fox’, ‘quick’) When , The resulting output or label It's all one one-hot vector , in other words , In the said ’quick’ The position value of is 1, Everything else is 0.

Negative sampling is Randomly select a small number of ’ negative (Negative)’ word ( such as 5 individual ), To update the parameters . there ’ negative ’ It represents the network output vector, the position of which is 0 The word for . Of course ,’ just (Positive)’( That's the right word ’quick’) The weights will also be updated .

It is stated in the paper that , On the scale of decimals 5-20, big data Set use 2-5 Word .

Our model weight matrix is 300x10000, The updated word is 5 individual ’ negative ’ Words and a ’ just ’ word , total 1800 Parameters , This is the output layer all 3M Parametric 0.06%

The selection of negative sampling is related to frequency , The more frequent , The greater the probability of negative sampling ：

Thesis selection 0.75 As an index, it is because the experimental results are good .C Language implementation code is very interesting ： First, fill each word in the vocabulary with index values multiple times , The number of occurrences of the word index is P(wi)∗table_size. Then negative sampling only needs to generate one 1 To 100M The integer of , And used to index the data in the table . Because words with high probability appear more often in the list , Probably choose these words .

Subsampling

In the example above , You can see frequent words ’the’ Two questions of :

For word pairs (‘fox’,’the’), Its pairs of words ’fox’ The semantic expression of does not help much ,’the’ Occurs very frequently in the context of each word .
It is expected that there are many word pairs (‘the’,…), We should learn words better ’the’

Word2vec The sub sampling technique is used to solve the above problems , Reduce the sampling rate of the word according to the frequency of the word . With window size by 10 As an example , We delete ’the’：

When we train the rest of the words ,’the’ Will not appear in their context .
When the central word is ’the’ when , The number of training samples is less than 10.

Sampling rate

P(wi)=1.0(100% Retain ), Corresponding z(wi)<=0.0026.( Indicates that the word frequency is greater than 0.0026 The word of will be subsampled )

P(wi)=0.5(50% Retain ), Corresponding z(wi)=0.00746.

P(wi)=0.033(3.3% Retain ), Corresponding z(wi)=1.0.( impossible )

TextCNN Text classification

https://blog.csdn.net/asialee_bird/article/details/88813385

（1） Embedded layer （Embedding Layer）

Through a hidden layer , take one-hot The encoded words are projected into a low dimensional space , It's essentially a feature extractor , Encode semantic features in the specified dimension . such , Words with similar meanings , Their Euclidean distance or cosine distance is also relatively close .（ The word vector used by the author is pre trained , Method is fasttext The resulting word vector , You can also use it word2vec and GloVe Method the word vector obtained by training ）.

（2） Convolution layer （Convolution Laye）

When processing image data ,CNN The convolution kernel used has the same width and height , But in text-CNN in , The width of convolution kernel is consistent with the dimension of word vector ！ This is because each line of vector we input represents a word , In the process of feature extraction , The minimum granularity of words as text . And the height and CNN equally , You can set it yourself （ Usually the value is 2,3,4,5）, The height is similar to n-gram 了 . Because our input is a sentence , The relevance between adjacent words in a sentence is very high , therefore , When we use convolution kernel for convolution , Not only the word meaning but also the word order and its context are considered （ Be similar to skip-gram and CBOW The idea of models ）.

（3） Pooling layer （Pooling Layer）

Because we use convolution kernels of different heights in the process of convolution layer , So that the vector dimensions we get through the convolution layer will be inconsistent , So in the pool layer , We use 1-Max-pooling Pool each eigenvector into a value , That is, the maximum value of each feature vector is extracted to represent the feature , And it is considered that this maximum value represents the most important feature . When we do all the eigenvectors 1-Max-Pooling after , You also need to splice each value together . Get the final eigenvector of the pool layer . You can add before the pool layer to the full connection layer dropout Prevent over fitting .

（4） Fully connected layer （Fully connected layer）

The full connectivity layer is the same as other models , Suppose there are two fully connected layers , The first layer can be added ’relu’ As an activation function , The second layer uses softmax The activation function gets the probability of belonging to each class .

Using pre-trained word2vec 、 GloVe The initialization effect will be better . Generally, it is not used directly One-hot.

The size of convolution kernel has a great influence , Usually take 1~10, For longer texts , You should choose a bigger one .

The number of convolution kernels also has a great influence , Usually take 100~600 , At the same time, the general use of Dropout（0~0.5）.

Generally, the activation function is ReLU and tanh.

Pool use 1-max pooling.

With feature map increase in numbers , When performance is reduced , Try to be greater than 0.5 Of Dropout.

When evaluating model performance , Use cross validation .

glove and word2vec、 LSA What's the difference between comparison ？（word2vec vs glove vs LSA）

1）glove vs LSA

LSA（Latent Semantic Analysis） Can be based on co-occurance matrix Construct word vectors , In essence, it is based on the global corpus SVD Perform matrix decomposition , However SVD High computational complexity ;
glove Can be regarded as right LSA An optimized and efficient matrix factorization algorithm , use Adagrad Optimize the least square loss ;

2）word2vec vs glove

word2vec It is trained by local corpus , Its feature extraction is based on sliding window ; and glove The sliding window of is to build co-occurance matrix, It is based on the global corpus , so glove The co-occurrence probability needs to be calculated in advance ; therefore ,word2vec Online learning is available ,glove It is necessary to make statistics of the fixed corpus information .
)log(X_{ij})
word2vec The loss function is essentially a weighted cross entropy , Fixed weight ;glove The loss function of is the least square loss function , Weights can be mapped .
overall ,glove It can be seen as replacing the objective function and the global weight function word2vec.

2.2 Transformer

https://cloud.tencent.com/developer/article/1868051

2.3 BERT

https://cloud.tencent.com/developer/article/1865094

https://cloud.tencent.com/developer/article/1855316 BERT Text classification

bert The full name is Bidirectional Encoder Representation from Transformers,bert The core of is two-way Transformer Encoder, Ask and answer the following questions ：

Autoregressive and self coding models

The mainstream language models can be divided into Autoregression (Autoregressive) and Self coding (AutoEncoder) Model , The difference is that the autoregressive model only uses the above or below to predict the probability of a word ( Such as GPT, ELMo), The self coding model can be combined with context at the same time ( Such as BERT).

Why? bert Take a two-way approach Transformer Encoder, Instead of calling decoder？

BERT Transformer Use two-way self-attention, and GPT Transformer Limited use of self-attention, Each of them token Only the context to its left can be processed . two-way Transformer Usually called “Transformer encoder”, The left context is called “Transformer decoder”,decoder Can't get the information to be predicted .

elmo、GPT and bert Differences in the processing of one-way and two-way language models ？

In the above 3 In a model , Only bert Co dependent on left and right context . that elmo Isn't it two-way ？ actually elmo It uses independently trained left to right and right to left LSTM Connected in series . and GPT Use left to right Transformer, The actual is “Transformer decoder”.

bert Isn't it easy to build a two-way language model ？ No, it can also be like elmo Splicing Transformer decoder Do you ？

BERT The author thinks that , This kind of splicing bi-directional Still can't fully understand the semantics of the whole sentence . A better way is to use contextual omnidirectional prediction mask, Also is to use “ can / Realization / Language / characterization /../ Of / Model ”, To predict the mask.BERT The author puts the context omnidirectional prediction method , be called deep bi-directional.

bert Why take Marked LM, Instead of directly applying Transformer Encoder？

We know to Transformer The deeper this goes , The learning effect will be better . But why not apply the two-way model directly ？ As the network depth increases, it will lead to label leakage . Here's the picture ：

The conflict between bidirectional coding and network depth

Depth bidirectional model ratio left-to-right A model or left-to-right and right-to-left The shallow connection of the model is more powerful . Unfortunately , The standard conditional language model can only be trained from left to right or from right to left , Because bidirectional conditionality will allow each word to be grounded in the middle of a multi-level context “see itself”.

To train a deep two-way representation （deep bidirectional representation）, The team took a simple approach , Random screening （masking） Partial input token, Then only those that are blocked token. The paper calls this process “masked LM”(MLM).

bert Why not always use practical MASK token Replace the quilt “masked” 's vocabulary ？

NLP Required reading | Ten minutes to read Google BERT Model ： Although it does give the team a two-way pre training model , But this method has two disadvantages . First , Pre training and finetuning There's no match , Because in finetuning I never saw MASKtoken. To solve this problem , Teams don't always use practical MASKtoken Replace the quilt “masked” 's vocabulary . contrary , The training data generator is randomly selected 15％ Of token. For example, in this sentence “my dog is hairy” in , It chose token yes “hairy”. then , Perform the following procedure ： The data generator will do the following , Instead of always using MASK Replace the selected word ： 80％ Time for ： use MASK Mark replacement words , for example ,my dog is hairy → my dog is MASK 10％ Time for ： Replace the word with a random word , for example ,my dog is hairy → my dog is apple 10％ Time for ： Keep the word the same , for example ,my dog is hairy → my dog is hairy. The purpose of this is to bias the representation towards the actual observed words . Transformer encoder It is not known which words it will be asked to predict or which words have been replaced by random words , So it's forced to keep every input token The distributed context representation of . Besides , Because random substitution only happens in all token Of 1.5％（ namely 15％ Of 10％）, This does not seem to impair the model's ability to understand language . Use MLM The second drawback is that each batch Only predicted 15％ Of token, This suggests that the model may require more pre-training steps to converge . Team proof MLM The rate of convergence is slightly slower than left-to-right Model of （ Forecast each token）, but MLM The improvement of the model in the experiment far exceeds the increased training cost .

bert The main innovations of the model are pre-train On the way , That's it Masked LM and Next Sentence Prediction Two methods capture words and sentence level representation.

Here is Transformer Encoder The overall structure of the model ：

Transformer Encoder

multi-head attention

tokenize

https://cloud.tencent.com/developer/article/1865689

Don't think about the reasons for the bull ,self-attention The middle word vector is not multiplied QKV Parameter matrix , What's the problem ？

Self-Attention The core is Use other words in the text to enhance the semantic representation of the target word , In order to make better use of context information .

self-attention in ,sequence Every word in the will be associated with sequence Do dot product to calculate the similarity of each word in , Including the word itself .

about self-attention, Generally speaking, it's q=k=v, The equality here actually means that they come from the same fundamental vector , And in the actual calculation , They are different , Because all three of them are riding QKV Of the parameter matrix . If you don't take , Corresponding to each word q,k,v It's exactly the same .

In the same order of magnitude ,qi And ki The value of the dot product will be the largest （ It can be downloaded from “ Two numbers and the same case , The product of two equal numbers is the largest ” By analogy ）.

That's in softmax In the weighted average after , The proportion of the word itself will be the largest , So that the proportion of other words is very small , It is impossible to use context information to enhance the semantic representation of the current word .

And multiply by QKV Parameter matrix , Will make every word q,k,v Is not the same , The above effects can be greatly reduced .

Of course ,QKV The parameter matrix also makes the bull , Be similar to CNN Multi core in , To capture richer features / Information becomes possible .

Why? BERT choice mask fall 15% The proportion of this word , Can it be any other ratio ？

BERT Adopted Masked LM, Will select all the words in the corpus 15% Do random mask, The paper is inspired by cloze task , But in fact And CBOW There's also the magic of the same thing .

from CBOW The angle of , here

A better explanation is ： In a size of

Select a word randomly from the window of , similar CBOW The central word of the sliding window in , The difference is that the sliding window here is non overlapping .

Then from CBOW Sliding window angle of ,10%~20% All of them are still ok The proportion of .

The above unofficial explanation , It's from a perspective of understanding provided by a friend of mine , For reference .

Use BERT Why the pre training model can only input at most 512 Word , At most two sentences can be combined into one sentence ？

This is a Google BERT The reason for the initial setup of the pre training model , Former correspondence Position Embeddings, The latter corresponds to Segment Embeddings

stay BERT in ,Token,Position,Segment Embeddings It's all through learning ,pytorch In the code, they look like this

self.word_embeddings = Embedding(config.vocab_size, config.hidden_size)
self.position_embeddings = Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = Embedding(config.type_vocab_size, config.hidden_size)

Above BERT pytorch Code from :https://github.com/xieyufei1993/Bert-Pytorch-Chinese-TextClassification, The structure is very clear .

And in the BERT config in

"max_position_embeddings": 512
"type_vocab_size": 2

therefore , In direct use Google Of BERT When training the model , Input most 512 Word （ And get rid of [CLS] and [SEP]）, At most two sentences make up one sentence . There will be no correspondence between words and sentences embedding.

Of course , If you have enough hardware resources to retrain yourself BERT, You can change BERT config, Set it bigger max_position_embeddings and type_vocab_size It's worth meeting your needs .

Why? BERT Before the first sentence, we will add a [CLS] sign ?

BERT Before the first sentence, we will add a [CLS] sign , At the last level, the corresponding vector of this bit can be used as the semantic representation of the whole sentence , So it can be used for classification tasks downstream .

Why choose it , Because compared with other words already in the text , This symbol without obvious semantic information will more “ fair ” To integrate the semantic information of each word in the text , So as to better express the semantics of the whole sentence .

say concretely ,self-attention It uses other words in the text to enhance the semantic representation of the target word , But the semantics of the target word itself will still be the main part , therefore , after BERT Of 12 layer , Every time the word embedding It's a combination of all the words , We can express our semantics better .

and [CLS] Bit itself has no semantics , after 12 layer , Get is attention The weighted average of all words after , Compared with other normal words , It can better represent sentence semantics .

Of course , It can also be done by... For all the words in the last layer embedding do pooling To represent the meaning of a sentence .

Add here bert Output , There are two kinds of , stay BERT TF The source code corresponds to ：

One is get_pooled_out(), That's what happened [CLS] It means , Output shape yes [batch size,hidden size].

One is get_sequence_out(), What you get is the whole sentence, each of them token Vector representation of , Output shape yes [batch_size, seq_length, hidden_size], It also includes [CLS], So I'm doing token Pay attention to level tasks .

Self-Attention How to calculate the time complexity of ？

Self-Attention Time complexity ：

, here ,n It's the length of the sequence ,d yes embedding Dimensions .

Self-Attention Include Three steps ： Similarity calculation ,softmax And weighted average , The time complexity of their separation is ：

Similarity calculation can be regarded as the size of (n,d) and (d,n) Multiply the two matrices of ：

, Get one (n,n) Matrix

softmax It's a direct calculation , The time complexity is

The weighted average can be taken as the size of (n,n) and (n,d) Multiply the two matrices of ：

, Get one (n,d) Matrix

therefore ,Self-Attention The time complexity of is

Here's another analysis Multi-Head Attention, Its function is similar to CNN Multi core in .

The implementation of multiple headers is not a loop to compute each header , But through transposes and reshapes, By matrix multiplication .

In practice, the multi-headed attention are done with transposes and reshapes rather than actual separate tensors. —— come from google BERT Source code

Transformer/BERT Zhongba d , That is to say hidden_size/embedding_size This dimension does reshape Split , You can see it. Google Of TF Source code Or on top of it pytorch Source code ：

hidden_size (d) = num_attention_heads (m) * attention_head_size (a), That is to say d=m*a

And will num_attention_heads dimension transpose To the front , bring Q and K The dimensions of are all (m,n,a), I don't think about it here batch dimension .

In this way, the dot product can be regarded as the size of (m,n,a) and (m,a,n) Multiply the two tensors of , Get one (m,n,n) Matrix , It's the same thing as (n,a) and (a,n) Multiply the two matrices of , Did m Time , Time complexity （ Thank you for pointing out that ） yes

Time complexity analysis of tensor multiplication. See ： matrix 、 Time complexity analysis of tensor multiplication

therefore Multi-Head Attention Time complexity is also

, There is no change in complexity compared to the single head , Mainly still transposes and reshapes The operation of , It is equivalent to the multiplication of a large matrix into the multiplication of several small matrices .

Transformer Where to do weight sharing , Why can we do weight sharing ？

Transformer Weight sharing in two places ：

（1）Encoder and Decoder Between the Embedding Layer weight sharing ;

（2）Decoder in Embedding Layer and the FC Layer weight sharing .

about （1）,《Attention is all you need》 in Transformer Used in machine translation tasks , The source language is not the same as the target language , But they can share a big word list , For words that appear in both languages （ such as ： Numbers , Punctuation and so on ） Can get a better representation , And for Encoder and Decoder, When embedded, only the corresponding language embedding Will be activated , Therefore, it is possible to share a common vocabulary for weight sharing .

In the paper ,Transformer The vocabulary uses bpe To deal with it , So the smallest unit is subword. English and German belong to the same Germanic family , There are many of the same subword, Can share similar semantics . And languages like Chinese and English are quite different , Semantic sharing may not be very useful .

however , Sharing a vocabulary will increase the number of words , increase softmax Calculation time , Therefore, whether to share in actual use may need to be weighed according to the situation .

This point refers to ：https://www.zhihu.com/question/333419099/answer/743341017

about （2）,Embedding Layers can be said to be through onehot To get the corresponding embedding vector ,FC Layers can be said to be the opposite , Through vectors （ Defined as x） To get that it could be a word softmax probability , Take the maximum probability （ In the case of greed ） As a predictor .

Which one is the most likely ？ stay FC On the premise that each row of a layer has the same magnitude , In theory and x The dot product of the same line and softmax The probability will be the biggest （ It can be compared with the problem in this paper 1）.

therefore ,Embedding Layer and the FC Layer weight sharing ,Embedding Layers and vectors x The word corresponding to the closest line , A greater prediction probability will be obtained . actually ,Decoder Medium Embedding Layer and the FC Layers are sort of like inverse processes .

Through such weight sharing, the number of parameters can be reduced , Speed up convergence .

But at first I had a puzzle that ：Embedding The dimension of layer parameter is ：(v,d),FC The dimension of layer parameter is ：(d,v), It can be shared directly , Or transpose ？ among v It's Thesaurus size ,d yes embedding dimension .

see pytorch Source code discovery can be shared directly ：

fc = nn.Linear(d, v, bias=False)    # Decoder FC Layer definition 

weight = Parameter(torch.Tensor(out_features, in_features))   # Linear Layer weight definition

Linear In the definition of layer weight , Is in accordance with the (out_features, in_features) In order , The actual calculation will first weight Transpose to multiply the input matrix . therefore FC layer Corresponding Linear The weight dimension is also (v,d), Can share directly .

BERT Where is the source of nonlinearity ？

Feedforward layer gelu Activation functions and self-attention,self-attention It's nonlinear , Thank you for pointing out that .

BERT One of the three Embedding Does direct addition affect semantics ？

It's a very interesting question , Su Jianlin also gave an answer , It's really wonderful ：

Embedding The mathematical nature of , That is to say one hot Single layer full connection for input . in other words , There is nothing in the world Embedding, It's just one hot.

I'd like to try to explain it again with an example ：

hypothesis token Embedding The matrix dimension is [4,768];position Embedding The matrix dimension is [3,768];segment Embedding The matrix dimension is [2,768].

For a word , Suppose it's token one-hot yes [1,0,0,0]; its position one-hot yes [1,0,0]; its segment one-hot yes [1,0].

The last word of the word word Embedding, It's the above three kinds of Embedding The sum of .

So you get word Embedding, and concat After the characteristics of ：[1,0,0,0,1,0,0,1,0], And then the dimension is [4+3+2,768] = [9, 768] The full connection layer of , The vector that you get is actually the same .

Another way to understand ：

Directly put three one-hot features concat Get up and get [1,0,0,0,1,0,0,1,0] No more one-hot 了 , But you can map it to three one-hot Feature space composed of , The spatial dimension is 4*3*2=24 , That's in the new feature space , This word one-hot Namely [1,0,0,0,0...] (23 individual 0).

here ,Embedding The matrix dimension is [24,768], final word Embedding It's still equivalent to the above , But three little Embedding The size of the matrix will be much smaller than that of the new feature space Embedding Matrix size .

Of course , Under the same initialization method , Two ways to get word Embedding Maybe the variance will be different , however ,BERT also Layer Norm, Will be able to Embedding The results are unified to the same distribution .

BERT One of the three Embedding Add up , Essence can be seen as a fusion of features , As strong as BERT The semantic features of information should be learned after fusion .

Reference resources ：https://www.zhihu.com/question/374835153

The following two questions are also very good , Worthy of special attention , But there are good answers on the Internet , as follows ：

Transformer What's the reason for scaling ？

Reference resources ：https://www.zhihu.com/question/339723385

stay BERT Application , How to solve the problem of long text ？

Reference resources ：https://www.zhihu.com/question/3274

2.4 NLU Mission

https://easyai.tech/ai-definition/nlu/

https://zhuanlan.zhihu.com/p/143221527

2.5 NLG Mission

https://zhuanlan.zhihu.com/p/375142707

3. leetcode

Inverse number pair problem

set up A1..n Is a containing n An array of different numbers . If in i < j Under the circumstances , Yes Ai > Aj, be (i, j) It's called A One of the pairs in reverse order （inversion）. Give an algorithm , It can use O(n log n) Worst run time of , determine n The number of pairs in reverse order in any arrangement of elements .‘

Merge sort

List flip

https://cloud.tencent.com/developer/article/1835266

iteration recursive

Find all subsets of the set

Depth-first algorithm BFS

Reservoir sampling algorithm

https://www.jianshu.com/p/7a9ea6ece2af

Given a data stream , Data stream length N It's big , And N It's not known until all the data has been processed , How to traverse data only once （O(N)） Under the circumstances , Be able to pick out at random m Data that doesn't repeat .

The idea of the algorithm is as follows ：

If the amount of data received is less than m, Then put them into the reservoir in turn .
When receiving the second i When a data ,i >= m, stay 0, i Take a random number in the range d, if d It's down to 0, m-1 Within the scope of , Then use the received second i The second data replaces the second in the reservoir d Data .
Repeat step 2.

The beauty of the algorithm is ： When all the data is processed , Every data in the reservoir is in m/N The probability of .

Distributed reservoir

If you encounter a huge amount of data , Even if it's O(N) Time complexity of , The reservoir sampling procedure will also take a long time to complete the sampling task . Therefore, Distributed Reservoir sampling algorithm came into being . The operating principle is as follows ：

Suppose there is K Taiwan machine , Divide large data sets into K Data streams , Each machine uses a stand-alone reservoir to sample one data stream , Sampling m Data , And finally record the amount of data processed as N1, N2, ..., Nk, ..., NK( hypothesis m<Nk).N1+N2+...+NK=N.
take 1, N A random number d, if d<N1, In the reservoir of the first machine, there is a medium probability that it will not be put back to the ground （1/m） Select a data ; if N1<=d<(N1+N2), Then select a data in the reservoir of the second machine with medium probability without putting it back ; One analogy , repeat m Time , Then finally from N Large data sets m Data .

m/N The probability verification of is as follows ：

The first k The probability that the reservoir data in this machine is selected is m/Nk.

From k The probability of selecting a data from the reservoir of a machine and putting it into the final reservoir is Nk/N.

The first k The probability that a data of a machine reservoir is selected is 1/m.（ Do not put back the equal probability of selection ）

repeat m Secondary selection , Then the probability of each data being selected is m*(m/Nk*Nk/N*1/m)=m/N.

Generate brackets

Given n Parenthesis , Write a function to generate all combinations of pairs of parentheses .

For example, given n = 3, a solution set is: “((()))”, “(()())”, “(())()”, “()(())”, “()()()”

If the left parenthesis is left , You can place an open parenthesis , If the remainder of the right bracket is greater than the left bracket , You can place a closing bracket .

https://cloud.tencent.com/developer/article/1835234

coin change

https://cloud.tencent.com/developer/article/1835233

dsj = 1

di = di-sj + 1 (di-sj != 0 )

-> dpx + c = min(dpx + 1, dpx + c)

4. big data

4.1 Heavy data text de duplication

https://www.jianshu.com/p/21d743ddc397

Use spark The advantages of the de duplication method of the partition sorting de duplication operator

a、 lie in sortWithinPartitions Method returns by Partition In order DataFrame object .

We just need to Partition In order DataFrame Object is followed by dropDuplicates Delete the same column , Return to one DataFrame, Get the latest time data .

b、 This de duplication method greatly improves the de duplication performance of massive data , Reduce the process shuffle operation .

5. Ref

5.1 Face the

4.2 Reference resources

https://zhuanlan.zhihu.com/p/115014536 Review of pre training models

原网站

版权声明
本文为[Goose]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/09/20210901223646948V.html