当前位置:网站首页>Sampling strategy and decoding strategy based on seq2seq text generation

Sampling strategy and decoding strategy based on seq2seq text generation

2022-06-25 08:29:00 Happy little yard farmer

be based on seq2seq Decoding of text generation 、 The sampling strategy ?

be based on Seq2Seq There are different ways to generate text for models decoding strategy. In text generation decoding strategy It can be divided into two categories :

  • Argmax Decoding: It mainly includes beam search, class-factored softmax etc.
  • Stochastic Decoding: It mainly includes temperature sampling, top-k sampling etc. .

stay Seq2Seq In the model ,RNN Encoder Code the input sentences , Generate a fixed size hidden state h c h_c hc ; Based on the input sentence hidden state h c h_c hc And the previously generated 1 To t-1 Word x 1 : t − 1 x_{1:t-1} x1:t1,RNN Decoder Will generate the current second t One word hidden state h t h_t ht , Finally through softmax Function to get the second t Word x t x_t xt Of vocabulary probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(xx1:t1).

Two types of decoding strategy The main difference is , How to go from vocabulary probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(xx1:t1) Choose a word from x t x_t xt

  • Argmax Decoding The best way is to choose from the word list probability The biggest word , namely x t = a r g m a x P ( x ∣ x 1 : t − 1 ) x_t=argmax\quad P(x|x_{1:t-1}) xt=argmaxP(xx1:t1) ;
  • Stochastic Decoding Is based on probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(xx1:t1) Random sample A word x t x_t xt, namely x t ∼ P ( x ∣ x 1 : t − 1 ) x_t \sim P(x|x_{1:t-1}) xtP(xx1:t1) .

Doing it seq predcition when , We need to model every moment according to the hypothesis softmax The output probability of sample word , Appropriate sample Methods may achieve more effective results .

1. Greedy sampling

1.1 Greedy Search

The core idea : Take the most probable result at each step , As a result .

The specific methods : Get the newly generated word is vocab The probability of each word in , take argmax As a word vector index to be generated , Then the latter word is generated .

1.2 Beam Search

The core idea :beam search Try to optimize the search space on the basis of breadth first ( It's like pruning ) Achieve the purpose of reducing memory consumption .

The specific methods : stay decoding Every step of , We all keep top K A possible candidate word , Then it's time for the next step , We're talking about this K Do the next step for every word decoding, Selected separately top K, And then to the K^2 Choose a candidate sentence and choose top K A sentence . And so on until decoding End . Of course Beam Search In essence, it is also a greedy decoding Methods , So we can't guarantee that we can get the best decoding result .

Greedy Search and Beam Search The problem is :

  1. Prone to repetition 、 Predictable words ;
  2. The sentence / Poor coherence of language .

2. Random sampling

The core idea : Random sampling according to the probability distribution of words .

2.1 Temperature Sampling:

The specific methods : stay softmax Introduce a temperature To change vocabulary probability distribution, Make it more biased towards high probability words:
P ( x ∣ x 1 : t − 1 ) = e x p ( u t / t e m p e r a t u r e ) ∑ t ′ e x p ( u t ′ / t e m p e r a t u r e ) , t e m p e r a t u r e ∈ [ 0 , 1 ) P(x|x_{1:t-1})=\frac{exp(u_t/temperature)}{\sum_{t'}exp(u_{t'}/temperature)},temperature\in[0,1) P(xx1:t1)=texp(ut/temperature)exp(ut/temperature),temperature[0,1)
Another way to express : hypothesis p ( x ) p(x) p(x) The original distribution output for the model , Given a temperature value , The original probability distribution will be computed as follows ( The model of softmax Output ) reweight , And we get a new probability distribution .
π ( x k ) = e l o g ( p ( x k ) ) / t e m p e r a t u r e ∑ i = 1 n e l o g ( p ( x i ) ) / t e m p e r a t u r e , t e m p e r a t u r e ∈ [ 0 , 1 ) \pi(x_{k})=\frac{e^{log(p(x_k))/temperature}} {\sum_{i=1}^{n}e^{log(p(x_i))/temperature}},temperature\in[0,1) π(xk)=i=1nelog(p(xi))/temperatureelog(p(xk))/temperature,temperature[0,1)
When t e m p e r a t u r e → 0 temperature \to 0 temperature0, It becomes greedy search; When t e m p e r a t u r e → ∞ temperature \to \infty temperature, It becomes uniform sampling (uniform sampling). See the paper for details :The Curious Case of Neural Text Degeneration

2.2 Top-k Sampling:

It can alleviate the problem of generating rare words . for instance , We can only be in the highest probability 50 Sample words according to the probability distribution . I only keep top-k individual probability 's words , Then do according to the probability in these words sampling.

The core idea : Sort probabilities in descending order , And then on the k The probability after a location is converted to 0.

The specific methods : stay decoding In the process , from P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(xx1:t1) Select the probability The highest front k individual tokens, Put their probability Add up to get p ′ = ∑ P ( x ∣ x 1 : t − 1 ) p'=\sum P(x|x_{1:t-1}) p=P(xx1:t1) , And then P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(xx1:t1) Adjusted for P ′ ( x ∣ x 1 : t − 1 ) = P ( x ∣ x 1 : t − 1 ) / p ′ P'(x|x_{1:t-1})=P(x|x_{1:t-1})/p' P(xx1:t1)=P(xx1:t1)/p , among x ∈ V ( k ) x\in V^{(k)} xV(k)! , Finally from the P ′ ( x ∣ x 1 : t − 1 ) P'(x|x_{1:t-1}) P(xx1:t1) in sample One token As output token. See the paper for details :Hierarchical Neural Story Generation

but Top-k Sampling The problem is , constant k Is the value given in advance , For different lengths and sizes , Sentences with different contexts , We may sometimes need to compare k added tokens.

2.3 Top-p Sampling (Nucleus Sampling ):

The core idea : By accumulating the probability distribution , Then when the accumulated value exceeds the set threshold p, Then set the subsequent probability 0.

The specific methods : Put forward Top-p Sampling To solve Top-k Sampling The problem of , be based on Top-k Sampling, It will p ′ = ∑ P ( x ∣ x 1 : t − 1 ) p'=\sum P(x|x_{1:t-1}) p=P(xx1:t1) Set as a pre-defined constant p ′ ∈ ( 0 , 1 ) p'\in(0,1) p(0,1) , and selected tokens According to the sentence history distribution Vary according to the changes of . See the paper for details :The Curious Case of Neural Text Degeneration

Essentially Top-p Sampling and Top-k Sampling from truncated vocabulary distribution in sample token, The difference in the choice of confidence interval is .

Problems with random sampling :

  1. The resulting sentences tend to be incoherent , The context is contradictory .
  2. It's easy to generate strange sentences , Rare words appear .

3. Reference

[0]: LSTM The text generated :《Python Deep learning 》 The first 8 Chapter one 1 section :8.1 Use LSTM The generation of textual P228-P234.
[1]: https://www.jiqizhixin.com/articles/2017-05-22 “ Overview of text generation ”
[2]: https://blog.csdn.net/weixin_40255337/article/details/83303702 “softmax sampling ”
[3]: https://blog.csdn.net/linchuhai/article/details/90140555 “ Common evaluation indicators for text generation tasks ”
[4]: https://www.zhihu.com/question/58482430/answer/373495424 “NLP Inside perplexity What is it? ?”
[5]: https://www.cnblogs.com/massquantity/p/9511694.html “ Use deep learning to generate text ”
[6]: https://github.com/massquantity/text-generation-using-keras “ The text generated ”
[7]: https://zhuanlan.zhihu.com/p/68383015 “ In text generation decoding strategy”
[8]: https://zhuanlan.zhihu.com/p/267471193?utm_source=wechat_session “ Language model sampling strategy ”

原网站

版权声明
本文为[Happy little yard farmer]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202200556380855.html