当前位置:网站首页>Sampling strategy and decoding strategy based on seq2seq text generation
Sampling strategy and decoding strategy based on seq2seq text generation
2022-06-25 08:29:00 【Happy little yard farmer】
List of articles
be based on seq2seq Decoding of text generation 、 The sampling strategy ?
be based on Seq2Seq There are different ways to generate text for models decoding strategy. In text generation decoding strategy It can be divided into two categories :
- Argmax Decoding: It mainly includes beam search, class-factored softmax etc.
- Stochastic Decoding: It mainly includes temperature sampling, top-k sampling etc. .
stay Seq2Seq In the model ,RNN Encoder Code the input sentences , Generate a fixed size hidden state h c h_c hc ; Based on the input sentence hidden state h c h_c hc And the previously generated 1 To t-1 Word x 1 : t − 1 x_{1:t-1} x1:t−1,RNN Decoder Will generate the current second t One word hidden state h t h_t ht , Finally through softmax Function to get the second t Word x t x_t xt Of vocabulary probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1).
Two types of decoding strategy The main difference is , How to go from vocabulary probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1) Choose a word from x t x_t xt :
- Argmax Decoding The best way is to choose from the word list probability The biggest word , namely x t = a r g m a x P ( x ∣ x 1 : t − 1 ) x_t=argmax\quad P(x|x_{1:t-1}) xt=argmaxP(x∣x1:t−1) ;
- Stochastic Decoding Is based on probability distribution P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1) Random sample A word x t x_t xt, namely x t ∼ P ( x ∣ x 1 : t − 1 ) x_t \sim P(x|x_{1:t-1}) xt∼P(x∣x1:t−1) .
Doing it seq predcition when , We need to model every moment according to the hypothesis softmax The output probability of sample word , Appropriate sample Methods may achieve more effective results .
1. Greedy sampling
1.1 Greedy Search
The core idea : Take the most probable result at each step , As a result .
The specific methods : Get the newly generated word is vocab The probability of each word in , take argmax As a word vector index to be generated , Then the latter word is generated .
1.2 Beam Search
The core idea :beam search Try to optimize the search space on the basis of breadth first ( It's like pruning ) Achieve the purpose of reducing memory consumption .
The specific methods : stay decoding Every step of , We all keep top K A possible candidate word , Then it's time for the next step , We're talking about this K Do the next step for every word decoding, Selected separately top K, And then to the K^2 Choose a candidate sentence and choose top K A sentence . And so on until decoding End . Of course Beam Search In essence, it is also a greedy decoding Methods , So we can't guarantee that we can get the best decoding result .
Greedy Search and Beam Search The problem is :
- Prone to repetition 、 Predictable words ;
- The sentence / Poor coherence of language .
2. Random sampling
The core idea : Random sampling according to the probability distribution of words .
2.1 Temperature Sampling:
The specific methods : stay softmax Introduce a temperature To change vocabulary probability distribution, Make it more biased towards high probability words:
P ( x ∣ x 1 : t − 1 ) = e x p ( u t / t e m p e r a t u r e ) ∑ t ′ e x p ( u t ′ / t e m p e r a t u r e ) , t e m p e r a t u r e ∈ [ 0 , 1 ) P(x|x_{1:t-1})=\frac{exp(u_t/temperature)}{\sum_{t'}exp(u_{t'}/temperature)},temperature\in[0,1) P(x∣x1:t−1)=∑t′exp(ut′/temperature)exp(ut/temperature),temperature∈[0,1)
Another way to express : hypothesis p ( x ) p(x) p(x) The original distribution output for the model , Given a temperature value , The original probability distribution will be computed as follows ( The model of softmax Output ) reweight , And we get a new probability distribution .
π ( x k ) = e l o g ( p ( x k ) ) / t e m p e r a t u r e ∑ i = 1 n e l o g ( p ( x i ) ) / t e m p e r a t u r e , t e m p e r a t u r e ∈ [ 0 , 1 ) \pi(x_{k})=\frac{e^{log(p(x_k))/temperature}} {\sum_{i=1}^{n}e^{log(p(x_i))/temperature}},temperature\in[0,1) π(xk)=∑i=1nelog(p(xi))/temperatureelog(p(xk))/temperature,temperature∈[0,1)
When t e m p e r a t u r e → 0 temperature \to 0 temperature→0, It becomes greedy search; When t e m p e r a t u r e → ∞ temperature \to \infty temperature→∞, It becomes uniform sampling (uniform sampling). See the paper for details :The Curious Case of Neural Text Degeneration
2.2 Top-k Sampling:
It can alleviate the problem of generating rare words . for instance , We can only be in the highest probability 50 Sample words according to the probability distribution . I only keep top-k individual probability 's words , Then do according to the probability in these words sampling.
The core idea : Sort probabilities in descending order , And then on the k The probability after a location is converted to 0.
The specific methods : stay decoding In the process , from P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1) Select the probability The highest front k individual tokens, Put their probability Add up to get p ′ = ∑ P ( x ∣ x 1 : t − 1 ) p'=\sum P(x|x_{1:t-1}) p′=∑P(x∣x1:t−1) , And then P ( x ∣ x 1 : t − 1 ) P(x|x_{1:t-1}) P(x∣x1:t−1) Adjusted for P ′ ( x ∣ x 1 : t − 1 ) = P ( x ∣ x 1 : t − 1 ) / p ′ P'(x|x_{1:t-1})=P(x|x_{1:t-1})/p' P′(x∣x1:t−1)=P(x∣x1:t−1)/p′ , among x ∈ V ( k ) x\in V^{(k)} x∈V(k)! , Finally from the P ′ ( x ∣ x 1 : t − 1 ) P'(x|x_{1:t-1}) P′(x∣x1:t−1) in sample One token As output token. See the paper for details :Hierarchical Neural Story Generation
but Top-k Sampling The problem is , constant k Is the value given in advance , For different lengths and sizes , Sentences with different contexts , We may sometimes need to compare k added tokens.
2.3 Top-p Sampling (Nucleus Sampling ):
The core idea : By accumulating the probability distribution , Then when the accumulated value exceeds the set threshold p, Then set the subsequent probability 0.
The specific methods : Put forward Top-p Sampling To solve Top-k Sampling The problem of , be based on Top-k Sampling, It will p ′ = ∑ P ( x ∣ x 1 : t − 1 ) p'=\sum P(x|x_{1:t-1}) p′=∑P(x∣x1:t−1) Set as a pre-defined constant p ′ ∈ ( 0 , 1 ) p'\in(0,1) p′∈(0,1) , and selected tokens According to the sentence history distribution Vary according to the changes of . See the paper for details :The Curious Case of Neural Text Degeneration
Essentially Top-p Sampling and Top-k Sampling from truncated vocabulary distribution in sample token, The difference in the choice of confidence interval is .
Problems with random sampling :
- The resulting sentences tend to be incoherent , The context is contradictory .
- It's easy to generate strange sentences , Rare words appear .
3. Reference
[0]: LSTM The text generated :《Python Deep learning 》 The first 8 Chapter one 1 section :8.1 Use LSTM The generation of textual P228-P234.
[1]: https://www.jiqizhixin.com/articles/2017-05-22 “ Overview of text generation ”
[2]: https://blog.csdn.net/weixin_40255337/article/details/83303702 “softmax sampling ”
[3]: https://blog.csdn.net/linchuhai/article/details/90140555 “ Common evaluation indicators for text generation tasks ”
[4]: https://www.zhihu.com/question/58482430/answer/373495424 “NLP Inside perplexity What is it? ?”
[5]: https://www.cnblogs.com/massquantity/p/9511694.html “ Use deep learning to generate text ”
[6]: https://github.com/massquantity/text-generation-using-keras “ The text generated ”
[7]: https://zhuanlan.zhihu.com/p/68383015 “ In text generation decoding strategy”
[8]: https://zhuanlan.zhihu.com/p/267471193?utm_source=wechat_session “ Language model sampling strategy ”
边栏推荐
- 微信小程序_7,项目练习,本地生活
- Overview of image super score: the past and present life of image super score in a single screen (with core code)
- STM32CubeMX 學習(5)輸入捕獲實驗
- DNS protocol and its complete DNS query process
- How to do factor analysis? Why should data be standardized?
- 4个不可不知的采用“安全左移”的理由
- [thesis study] vqmivc
- 什么是SKU和SPU,SKU,SPU的区别是什么
- What do various optimizers SGD, adagrad, Adam and lbfgs do?
- Ffmpeg+sdl2 for audio playback
猜你喜欢

How to calculate critical weight indicators?

Bluecmsv1.6- code audit

How to interpret the information weight index?
How to calculate the characteristic vector, weight value, CI value and other indicators in AHP?

Index analysis of DEMATEL model

Data preprocessing: discrete feature coding method

测一测现在的温度

Deep learning series 45: overview of image restoration

Establish open data set standards and enable AI engineering implementation

Biweekly investment and financial report: capital ambush Web3 infrastructure
随机推荐
Establish open data set standards and enable AI engineering implementation
The difference between personal domain name and enterprise domain name
Bat start NET Core
TCP acceleration notes
LeetCode_ Hash table_ Medium_ 454. adding four numbers II
Deep learning series 45: overview of image restoration
4个不可不知的采用“安全左移”的理由
Self made ramp, but it really smells good
What are the indicators of DEA?
Websocket understanding and application scenarios
在二叉树(搜索树)中找到两个节点的最近公共祖先(剑指offer)
面试前准备好这些,Offer拿到手软,将军不打无准备的仗
Bluecmsv1.6- code audit
Static web server
4 raisons inconnues d'utiliser le "déplacement sûr à gauche"
Unity Addressable批量管理
With the beauty of technology enabled design, vivo cooperates with well-known art institutes to create the "industry university research" plan
To achieve good software testing results, it is a prerequisite to build a good testing environment
初识生成对抗网络(12)——利用Pytorch搭建WGAN-GP生成手写数字
How to calculate the positive and negative ideal solution and the positive and negative ideal distance in TOPSIS method?