当前位置:网站首页>Sampling strategy and decoding strategy based on seq2seq text generation
Sampling strategy and decoding strategy based on seq2seq text generation
2022-06-25 08:29:00 【Happy little yard farmer】
List of articles
be based on seq2seq Decoding of text generation 、 The sampling strategy ?
be based on Seq2Seq There are different ways to generate text for models decoding strategy. In text generation decoding strategy It can be divided into two categories :
- Argmax Decoding: It mainly includes beam search, class-factored softmax etc.
- Stochastic Decoding: It mainly includes temperature sampling, top-k sampling etc. .
stay Seq2Seq In the model ,RNN Encoder Code the input sentences , Generate a fixed size hidden state h c h_c hc ; Based on the input sentence hidden state h c h_c hc And the previously generated 1 To t-1 Word x 1 : t − 1 x_{1:t-1} x1:t−1,RNN Decoder Will generate the current second t One word hidden state h t h_t ht , Finally through softmax Function to get the second t Word x t x_t xt Of vocabulary probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1).
Two types of decoding strategy The main difference is , How to go from vocabulary probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1) Choose a word from x t x_t xt :
- Argmax Decoding The best way is to choose from the word list probability The biggest word , namely x t = a r g m a x P ( x ∣ x 1 : t − 1 ) x_t=argmax\quad P(x|x_{1:t-1}) xt=argmaxP(x∣x1:t−1) ;
- Stochastic Decoding Is based on probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1) Random sample A word x t x_t xt, namely x t ∼ P ( x ∣ x 1 : t − 1 ) x_t \sim P(x|x_{1:t-1}) xt∼P(x∣x1:t−1) .
Doing it seq predcition when , We need to model every moment according to the hypothesis softmax The output probability of sample word , Appropriate sample Methods may achieve more effective results .
1. Greedy sampling
1.1 Greedy Search
The core idea : Take the most probable result at each step , As a result .
The specific methods : Get the newly generated word is vocab The probability of each word in , take argmax As a word vector index to be generated , Then the latter word is generated .
1.2 Beam Search
The core idea :beam search Try to optimize the search space on the basis of breadth first ( It's like pruning ) Achieve the purpose of reducing memory consumption .
The specific methods : stay decoding Every step of , We all keep top K A possible candidate word , Then it's time for the next step , We're talking about this K Do the next step for every word decoding, Selected separately top K, And then to the K^2 Choose a candidate sentence and choose top K A sentence . And so on until decoding End . Of course Beam Search In essence, it is also a greedy decoding Methods , So we can't guarantee that we can get the best decoding result .
Greedy Search and Beam Search The problem is :
- Prone to repetition 、 Predictable words ;
- The sentence / Poor coherence of language .
2. Random sampling
The core idea : Random sampling according to the probability distribution of words .
2.1 Temperature Sampling:
The specific methods : stay softmax Introduce a temperature To change vocabulary probability distribution, Make it more biased towards high probability words:
P ( x ∣ x 1 : t − 1 ) = e x p ( u t / t e m p e r a t u r e ) ∑ t ′ e x p ( u t ′ / t e m p e r a t u r e ) , t e m p e r a t u r e ∈ [ 0 , 1 ) P(x|x_{1:t-1})=\frac{exp(u_t/temperature)}{\sum_{t'}exp(u_{t'}/temperature)},temperature\in[0,1) P(x∣x1:t−1)=∑t′exp(ut′/temperature)exp(ut/temperature),temperature∈[0,1)
Another way to express : hypothesis p ( x ) p(x) p(x) The original distribution output for the model , Given a temperature value , The original probability distribution will be computed as follows ( The model of softmax Output ) reweight , And we get a new probability distribution .
π ( x k ) = e l o g ( p ( x k ) ) / t e m p e r a t u r e ∑ i = 1 n e l o g ( p ( x i ) ) / t e m p e r a t u r e , t e m p e r a t u r e ∈ [ 0 , 1 ) \pi(x_{k})=\frac{e^{log(p(x_k))/temperature}} {\sum_{i=1}^{n}e^{log(p(x_i))/temperature}},temperature\in[0,1) π(xk)=∑i=1nelog(p(xi))/temperatureelog(p(xk))/temperature,temperature∈[0,1)
When t e m p e r a t u r e → 0 temperature \to 0 temperature→0, It becomes greedy search; When t e m p e r a t u r e → ∞ temperature \to \infty temperature→∞, It becomes uniform sampling (uniform sampling). See the paper for details :The Curious Case of Neural Text Degeneration
2.2 Top-k Sampling:
It can alleviate the problem of generating rare words . for instance , We can only be in the highest probability 50 Sample words according to the probability distribution . I only keep top-k individual probability 's words , Then do according to the probability in these words sampling.
The core idea : Sort probabilities in descending order , And then on the k The probability after a location is converted to 0.
The specific methods : stay decoding In the process , from P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1) Select the probability The highest front k individual tokens, Put their probability Add up to get p ′ = ∑ P ( x ∣ x 1 : t − 1 ) p'=\sum P(x|x_{1:t-1}) p′=∑P(x∣x1:t−1) , And then P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1) Adjusted for P ′ ( x ∣ x 1 : t − 1 ) = P ( x ∣ x 1 : t − 1 ) / p ′ P'(x|x_{1:t-1})=P(x|x_{1:t-1})/p' P′(x∣x1:t−1)=P(x∣x1:t−1)/p′ , among x ∈ V ( k ) x\in V^{(k)} x∈V(k)! , Finally from the P ′ ( x ∣ x 1 : t − 1 ) P'(x|x_{1:t-1}) P′(x∣x1:t−1) in sample One token As output token. See the paper for details :Hierarchical Neural Story Generation
but Top-k Sampling The problem is , constant k Is the value given in advance , For different lengths and sizes , Sentences with different contexts , We may sometimes need to compare k added tokens.
2.3 Top-p Sampling (Nucleus Sampling ):
The core idea : By accumulating the probability distribution , Then when the accumulated value exceeds the set threshold p, Then set the subsequent probability 0.
The specific methods : Put forward Top-p Sampling To solve Top-k Sampling The problem of , be based on Top-k Sampling, It will p ′ = ∑ P ( x ∣ x 1 : t − 1 ) p'=\sum P(x|x_{1:t-1}) p′=∑P(x∣x1:t−1) Set as a pre-defined constant p ′ ∈ ( 0 , 1 ) p'\in(0,1) p′∈(0,1) , and selected tokens According to the sentence history distribution Vary according to the changes of . See the paper for details :The Curious Case of Neural Text Degeneration
Essentially Top-p Sampling and Top-k Sampling from truncated vocabulary distribution in sample token, The difference in the choice of confidence interval is .
Problems with random sampling :
- The resulting sentences tend to be incoherent , The context is contradictory .
- It's easy to generate strange sentences , Rare words appear .
3. Reference
[0]: LSTM The text generated :《Python Deep learning 》 The first 8 Chapter one 1 section :8.1 Use LSTM The generation of textual P228-P234.
[1]: https://www.jiqizhixin.com/articles/2017-05-22 “ Overview of text generation ”
[2]: https://blog.csdn.net/weixin_40255337/article/details/83303702 “softmax sampling ”
[3]: https://blog.csdn.net/linchuhai/article/details/90140555 “ Common evaluation indicators for text generation tasks ”
[4]: https://www.zhihu.com/question/58482430/answer/373495424 “NLP Inside perplexity What is it? ?”
[5]: https://www.cnblogs.com/massquantity/p/9511694.html “ Use deep learning to generate text ”
[6]: https://github.com/massquantity/text-generation-using-keras “ The text generated ”
[7]: https://zhuanlan.zhihu.com/p/68383015 “ In text generation decoding strategy”
[8]: https://zhuanlan.zhihu.com/p/267471193?utm_source=wechat_session “ Language model sampling strategy ”
边栏推荐
- Is there any risk in the security of new bonds
- Wechat applet introduction record
- 初识生成对抗网络(12)——利用Pytorch搭建WGAN-GP生成手写数字
- Data preprocessing: discrete feature coding method
- How to calculate the correlation coefficient and correlation degree in grey correlation analysis?
- Getting to know the generation confrontation network (11) -- using pytoch to build wgan to generate handwritten digits
- Almost taken away by this wave of handler interview cannons~
- What are the indicators of VIKOR compromise?
- Rank sum ratio (RSR) index calculation
- rosbag
猜你喜欢
Quickly build a real-time face mask detection system in five minutes (opencv+paddlehub with source code)
TCP stuff
家庭服务器门户Easy-Gate
Home server portal easy gate
Unity addressable batch management
GPU calculation
初识生成对抗网络(11)——利用Pytorch搭建WGAN生成手写数字
Super simple case: how to do hierarchical chi square test?
VOCALOID notes
A solution to slow startup of Anaconda navigator
随机推荐
是否可以给数据库表授予删除列对象的权限?为什么?
LeetCode_ Hash table_ Medium_ 454. adding four numbers II
4 raisons inconnues d'utiliser le "déplacement sûr à gauche"
五分钟快速搭建一个实时人脸口罩检测系统(OpenCV+PaddleHub 含源码)
Apache CouchDB Code Execution Vulnerability (cve-2022-24706) batch POC
Getting to know the generation confrontation network (12) -- using pytoch to build wgan-gp to generate handwritten digits
NIPS 2014 | Two-Stream Convolutional Networks for Action Recognition in Videos 阅读笔记
STM32CubeMX 学习(5)输入捕获实验
Is there no risk in the security of new bonds
How to analyze the grey prediction model?
以科技赋能设计之美,vivo携手知名美院打造“产学研”计划
想开个户,网上股票开户安不安全?
股票网上开户安全吗?小白求指导
面试前准备好这些,Offer拿到手软,将军不打无准备的仗
Remove headers from some pages in a word document
堆栈认知——栈溢出实例(ret2libc)
物联网毕设(智能灌溉系统 -- Android端)
How to calculate the D value and W value of statistics in normality test?
Bluecmsv1.6-代码审计
What are the indicators of VIKOR compromise?