当前位置：网站首页>Beam search and five optimization methods

Beam search and five optimization methods

2022-06-25 08:24:00 【Happy little yard farmer】

List of articles

Beam Search And optimization

Beam Search And optimization

1. Review Beam Search

Reference resources ： Wu enda Deep learning note Course 5 Week 3 Sequence Models

review beam search：

Yes greedy search Improved ： Expand the search space , It is easier to get the global optimal solution .beam search Contains a parameter beam width $B$ , Indicates that the highest score is reserved for each time step B A sequence of , Then use this next time step $B$ Candidate sequences continue to be generated .

Insert picture description here

stay sequence models in ,input text $\{x^{<1>},\dots,x^{<T_x>}\}$ Input Encoder, Re pass Decoder,softmax The layer will output the word list size $∣ V ∣$ Output probability values .
Beam Search Steps for ：

Find the probability value of the first output $P(y^{<1>}|x)$ , Take the largest front $B$ Save words .
Calculate the front... Separately $B$ In two cases Decoder Output $P(y^{<2>}|x,y^{<1>})$ , There will be a $B\cdot|V|$ results , Maximize the probability of the entire sentence $P(y^{<1>},y^{<2>}|x)=P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})$ , That is the first one. 、 The second word pair has the greatest probability , Choose the one with the highest probability $B$ Results and save .
Every step $B$ A network copy , To assess the probability of the next word . Continue calculation and select , Until the end of the sentence appears <EOS>.

Wu Enda's Andrew Ng Of Coursera In depth learning courses Beam Search Of Refinement, As below Normalization The first part of .

Here I sum up Beam Search There are several types of optimization methods ：

Hypothesis filtering： Filter the generated candidate sequences .
Normalization： in the light of Beam Search Improvement of score function .
Decoding with auxiliary language model： Use the auxiliary language model to improve decoder.
Decoding with lexical constraints： Add restrictions on words or phrases ,decoder You need to refer to the restricted vocabulary when generating .
Search algorithms： Optimize your search strategy , Looking for more efficient algorithms .

2. Beam Search Refinement

2.1 Hypothesis filtering

By controlling (unkown words) Quantity is right beam search Generated hypotheses Conduct filtering. When one hypothese It contains too much It can be drop fall . It should be noted that ,drop hypotheses Will temporarily reduce beam size.

2.2 Normalization

It's mainly about beam search To improve the score function , Including the length of the output sentence and coverage Add penalty .

2.2.1 Length Normalization

Reference resources ： Wu enda Deep learning note Course 5 Week 3 Sequence Models

Beam Search To maximize the probability $P(y^{<1>},\dots,y^{<T_y>}|X)$ , That is, the probability of measuring the whole sentence , It can be expressed as the product of conditional probabilities ：
$P(y^{<1>},\dots,y^{<T_y>}|X)=P(y^{<1>}|X)P(y^{<2>}|X,y^{<1>})P(y^{<3>}|X,y^{<1>},y^{<2>})\dots P(y^{<T_y>}|X,y^{<1>},y^{<2>},\dots,y^{<T_y-1>})$
Insert picture description here

Because every conditional probability is less than 1, Multiple less than 1 Multiply the values of , It will cause the value to go down （numerical underflow）. The value is too small , As a result, the floating-point representation of the computer cannot be accurately stored , So we can't maximize this product in practice , It's about taking $l o g$ value .

Maximize $l o g P (y ∣ x)$ The probability value of summation , The value is more stable , Not prone to numerical error （rounding errors） Or numeric underflow . Because the logarithmic function （ $l o g$ function ） Is a strictly monotonically increasing function , Maximize $l o g P (y ∣ x)$ And maximize $P (y ∣ x)$ It is equivalent. . In practice , We always record the logarithm sum of probabilities （the sum of logs of the probabilities）, Not the product of probability （the production of probabilities）.

The conclusion is right beam search Improvement of the objective function of ：

about beam search The original objective function of

$\max_{y} \prod_{t=1}^{T_y}P(y^{<t>}|X,y^{<1>},\dots,y^{<t-1>})$

If a sentence is long , Then the probability of this sentence will be very low , Because multiple less than 1 Multiply the numbers of to get a smaller probability value . This objective function tends to have short output , Because the probability of short sentences is less than 1 The product of the numbers obtained by .

For the improved beam search The objective function of

$\max_{y} \sum_{t=1}^{T_y}logP(y^{<t>}|X,y^{<1>},\dots,y^{<t-1>})$

Every probability $P(y|x)\in(0,1]$ , and $l o g P (y ∣ x)$ It's a negative number . If the sentence is long , Add up to a lot of items , The more negative the result is . So this objective function will also affect the longer output , Because long sentences always accumulate the value of the objective function will be very small .

improvement beam search： Introduce normalization , Reduce the penalty for outputting long sentences .

$\max_{y} \frac{1}{T_y^{\alpha}}\sum_{t=1}^{T_y}logP(y^{<t>}|X,y^{<1>},\dots,y^{<t-1>})$

$\alpha\in [0,1]$ It's a super parameter , Usually take $\alpha=0.7$ . The value is tentative , You can try different values , See which one gets the best results .

$\alpha=0$ , No .
$\alpha=1$ , Standard length normalization .

stay Beam Search Finally, use the above formula to choose the best sentence .

Insert picture description here

How to choose $B$ ?

$B$ The bigger it is （10~100）, The more you think about , Get better results , The greater the amount of calculation , The slower the speed .
$B$ Take small （1~10）, The results may not be good enough , Fast calculation .
General choice $B = 10$ .

Beam Search And DFS/BFS Different , Is an approximate search algorithm / Heuristic algorithm , Not looking for the global optimum , But to find the local optimum .

In the actual project , How to find out is RNN There's something wrong with the model , still Beam Search Something went wrong. ？

increase $B$ Generally, the model will not be affected , It's like adding data sets .

Insert picture description here

Beam Search Error analysis in ：

In the translation task , $y^*$ Indicates the result of human translation , $\hat{y}$ Represents the result of the algorithm .

use RNN Model calculation $P(y^*|x)$ and $P(\hat{y}|x)$ , That is to compare the expected value probability with the output probability , To discover that RNN still Beam Search The problem of .

Case one ： $P(y^*|x)>P(\hat{y}|x)$ ,Beam Search error .
The probability of expectation is higher , however Beam Search No expectations selected , yes Beam Search The problem of , By increasing B To improve .
The second case ： $P(y^*|x)\le P(\hat{y}|x)$ ,RNN Model error , Need to be in RNN Spend more time on the model .
RNN The model gives the probability of error $P(y^*|x)$ , Give a good sentence a small probability . You can use regularization , Larger training data , Different network structures and other methods are improved .

If length normalization is used , It is necessary to use the normalized value to judge .

The error analysis process is shown in the figure below , First traversal development set, Find out the error caused by the algorithm . for example , $P(y^*|x)=2\times 10^{-10}, P(\hat{y}|x)=1\times 10^{-10}$ , In the first case ,Beam Search error . The error types are marked in turn , Through this error analysis process , obtain Beam Search and RNN What is the proportion of model errors . If you find that Beam Search Cause most mistakes , Can increase beam width. contrary , If you find that RNN Models cause more errors , Deeper analysis can be done ： Whether to add regularization 、 Get more training data 、 Try a different network structure or other solution .

Insert picture description here

2.2.2 Coverage Normalization

The paper 1：Coverage-based Neural Machine Translation

The paper 2：Google’s Neural Machine Translation System: Bridging the Gapbetween Human and Machine Translation

There will be Beam Search The objective function of is called Beam Search Scoring function score. Log likelihood as score function , As shown below ,Beam Search Will tend to be shorter sequences . introduce length normalization Normalizing the length can solve this problem .
$score(x,y)=\sum_{t=1}^{T_y}logP(y_t|x,y_{<t})$
2016 Nian Hua is a member of Noah's Ark laboratory The paper 1 mention , Machine translation will exist over translation or undertranslation due to attention coverage . The author puts forward coverage-based atttention Mechanism to solve coverage problem .

Google’s Neural Machine Translation System Of The paper 2<7. Decoder> Put forward length normalization and coverage penalty.coverage penalty It is mainly used for Attention The occasion of , adopt coverage penalty It can make Decoder Focus evenly on the input sequence x Every one of token, Prevent some token Get too much Attention.

The paper 2 Log likelihood 、length normalization and coverage penalty Bind together , You can get new beam search Scoring function , As the formula below shows , among $X$ yes source, $Y$ It is the present. target, $l p$ yes length normalization, $c p$ yes coverage penalty, $p_{i,j}$ It's No $j$ individual target Corresponding to the first $i$ One input Attention value . Hyperparameters $\alpha$ and $\beta$ Each control length normalization and coverage penalty. If $\alpha=0 \;\& \;\beta=0$ ,Decoder Just return it to the normal Beam Search.

Insert picture description here

2.2.3 End of sentence Normalization

Reference resources ：Beam search

Sometimes Decoder Generating sequences is hard to stop , At this point, you need to control the maximum generation length , You can use the following formula to constrain ：
$ep(X,Y)=\gamma\frac{|X|}{|Y|}$
among $∣ X ∣$ yes source The length of , $∣ Y ∣$ It is the present. target The length of , $\gamma$ yes End of sentence normalization The coefficient of . Then we can know from the above formula ,target The longer the length , The lower the above score , This prevents the generation sequence from stopping .

The improved method of score function , There are still a lot of it , For example, there are papers 《 A Diversity-Promoting Objective Function for Neural Conversation Models》 Maximize mutual information MMI（Maximum Mutual Information） As an objective function , It can also be used as Beam Search The score function of .

2.3 Decoding with auxiliary language model

Beam search You can also use an auxiliary language model To define score function, for example “Shallow Fusion”:
$s(Y,X)=s_{TM}(Y,X)+\beta \cdot s_{LM}(Y)$
among $s_{LM}(Y)$ Is the output sequence $Y$ The language model of $L M$ Logarithmic probability of , $\beta$ It's a super parameter . Note the language model $L M$ Cannot use two-way RNN, Need to work with the translation model $T M$ Share the same vocabulary （ Marks and features ）.

2.4 Decoding with lexical constraints

Add restrictions to the vocabulary , bring beam search Just find the corresponding... At the input source token, Then limit the generation of words or phrases . In the paper Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search Described in Grid Beam Search Method , Accept simple one-token constraints.

2.5 Search algorithms

The paper 《Best-First Beam Search》 Focus on improving Beam Search Search efficiency , Put forward Best-First Beam Search Algorithm ,Best-First Beam Search Can get and in a shorter time standard Beam Search Same search results . It is mentioned in the paper that Beam Search The time is proportional to the number of calls to the scoring function , As shown in the figure below , Therefore, the author hopes to reduce the number of calls to the scoring function , Thus enhance Beam Search efficiency .

Insert picture description here

Best-First Beam Search A priority queue is used and a new comparison operator is defined , Thus, the number of calls to the scoring function can be reduced , Stop searching faster .