当前位置：网站首页>Sampling strategy and decoding strategy based on seq2seq text generation

Sampling strategy and decoding strategy based on seq2seq text generation

2022-06-25 08:29:00 【Happy little yard farmer】

List of articles

be based on seq2seq Decoding of text generation 、 The sampling strategy ？

be based on seq2seq Decoding of text generation 、 The sampling strategy ？

be based on Seq2Seq There are different ways to generate text for models decoding strategy. In text generation decoding strategy It can be divided into two categories ：

Argmax Decoding: It mainly includes beam search, class-factored softmax etc.
Stochastic Decoding: It mainly includes temperature sampling, top-k sampling etc. .

stay Seq2Seq In the model ,RNN Encoder Code the input sentences , Generate a fixed size hidden state $h_c$ ; Based on the input sentence hidden state $h_c$ And the previously generated 1 To t-1 Word $x_{1:t-1}$ ,RNN Decoder Will generate the current second t One word hidden state $h_t$ , Finally through softmax Function to get the second t Word $x_t$ Of vocabulary probability distribution $P(x|x_{1:t-1})$ .

Two types of decoding strategy The main difference is , How to go from vocabulary probability distribution $P(x|x_{1:t-1})$ Choose a word from $x_t$ ：

Argmax Decoding The best way is to choose from the word list probability The biggest word , namely $x_t=argmax\quad P(x|x_{1:t-1})$ ;
Stochastic Decoding Is based on probability distribution $P(x|x_{1:t-1})$ Random sample A word $x_t$ , namely $x_t \sim P(x|x_{1:t-1})$ .

Doing it seq predcition when , We need to model every moment according to the hypothesis softmax The output probability of sample word , Appropriate sample Methods may achieve more effective results .

1. Greedy sampling

1.1 Greedy Search

The core idea ： Take the most probable result at each step , As a result .

The specific methods ： Get the newly generated word is vocab The probability of each word in , take argmax As a word vector index to be generated , Then the latter word is generated .

1.2 Beam Search

The core idea ：beam search Try to optimize the search space on the basis of breadth first （ It's like pruning ） Achieve the purpose of reducing memory consumption .

The specific methods ： stay decoding Every step of , We all keep top K A possible candidate word , Then it's time for the next step , We're talking about this K Do the next step for every word decoding, Selected separately top K, And then to the K^2 Choose a candidate sentence and choose top K A sentence . And so on until decoding End . Of course Beam Search In essence, it is also a greedy decoding Methods , So we can't guarantee that we can get the best decoding result .

Greedy Search and Beam Search The problem is ：

Prone to repetition 、 Predictable words ;
The sentence / Poor coherence of language .

2. Random sampling

The core idea ： Random sampling according to the probability distribution of words .

2.1 Temperature Sampling：

The specific methods ： stay softmax Introduce a temperature To change vocabulary probability distribution, Make it more biased towards high probability words：
$P(x|x_{1:t-1})=\frac{exp(u_t/temperature)}{\sum_{t'}exp(u_{t'}/temperature)},temperature\in[0,1)$
Another way to express ： hypothesis $p (x)$ The original distribution output for the model , Given a temperature value , The original probability distribution will be computed as follows （ The model of softmax Output ） reweight , And we get a new probability distribution .
$\pi(x_{k})=\frac{e^{log(p(x_k))/temperature}} {\sum_{i=1}^{n}e^{log(p(x_i))/temperature}},temperature\in[0,1)$
When $\to 0$ , It becomes greedy search; When $\to \infty$ , It becomes uniform sampling （uniform sampling）. See the paper for details ：The Curious Case of Neural Text Degeneration

2.2 Top-k Sampling：

It can alleviate the problem of generating rare words . for instance , We can only be in the highest probability 50 Sample words according to the probability distribution . I only keep top-k individual probability 's words , Then do according to the probability in these words sampling.

The core idea ： Sort probabilities in descending order , And then on the k The probability after a location is converted to 0.

The specific methods ： stay decoding In the process , from $P(x|x_{1:t-1})$ Select the probability The highest front k individual tokens, Put their probability Add up to get $p'=\sum P(x|x_{1:t-1})$ , And then $P(x|x_{1:t-1})$ Adjusted for $P'(x|x_{1:t-1})=P(x|x_{1:t-1})/p'$ , among $x\in V^{(k)}$ ! , Finally from the $P'(x|x_{1:t-1})$ in sample One token As output token. See the paper for details ：Hierarchical Neural Story Generation

but Top-k Sampling The problem is , constant k Is the value given in advance , For different lengths and sizes , Sentences with different contexts , We may sometimes need to compare k added tokens.

2.3 Top-p Sampling (Nucleus Sampling ):

The core idea ： By accumulating the probability distribution , Then when the accumulated value exceeds the set threshold p, Then set the subsequent probability 0.

The specific methods ： Put forward Top-p Sampling To solve Top-k Sampling The problem of , be based on Top-k Sampling, It will $p'=\sum P(x|x_{1:t-1})$ Set as a pre-defined constant $p'\in(0,1)$ , and selected tokens According to the sentence history distribution Vary according to the changes of . See the paper for details ：The Curious Case of Neural Text Degeneration

Essentially Top-p Sampling and Top-k Sampling from truncated vocabulary distribution in sample token, The difference in the choice of confidence interval is .

Problems with random sampling ：

The resulting sentences tend to be incoherent , The context is contradictory .
It's easy to generate strange sentences , Rare words appear .

3. Reference

[0]: LSTM The text generated ：《Python Deep learning 》 The first 8 Chapter one 1 section ：8.1 Use LSTM The generation of textual P228-P234.
[1]: https://www.jiqizhixin.com/articles/2017-05-22 “ Overview of text generation ”
[2]: https://blog.csdn.net/weixin_40255337/article/details/83303702 “softmax sampling ”
[3]: https://blog.csdn.net/linchuhai/article/details/90140555 “ Common evaluation indicators for text generation tasks ”
[4]: https://www.zhihu.com/question/58482430/answer/373495424 “NLP Inside perplexity What is it? ？”
[5]: https://www.cnblogs.com/massquantity/p/9511694.html “ Use deep learning to generate text ”
[6]: https://github.com/massquantity/text-generation-using-keras “ The text generated ”
[7]: https://zhuanlan.zhihu.com/p/68383015 “ In text generation decoding strategy”
[8]: https://zhuanlan.zhihu.com/p/267471193?utm_source=wechat_session “ Language model sampling strategy ”

原网站

版权声明
本文为[Happy little yard farmer]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202200556380855.html