当前位置:网站首页>Making a Chatbot based on gpt2

Making a Chatbot based on gpt2

2022-06-24 04:02:00 Goose

1. background

Everyone must have experienced , For many reasons, a good friend no longer chats with you , So can we use his wechat chat records to roughly restore this person's chat habits, tone, and even facial expression packets that he likes to send ?

This blog is based on GPT2-Chinese About how to use friends' chat records to train a chat robot , However, the final effect still depends on whether the training materials are sufficient , And model selection , Parameter adjustment, etc , It is not difficult to run successfully , But it is difficult for debugging to imitate well , If you are interested, you can try other modeling methods or corpus selection .

The second half of the article will try to talk about GPT2 Principle and tuning of .

Don't talk much , Let's start with what is probably the most complicated part of this article , The development and running environment is almost ready demo Half of it .

2. GPT-2 Principle introduction

Prior to Blog You can review transformer BERT GPT Wait for the model

GPT-2 It's using 「transformer Decoder module 」 Built , and BERT It is through 「transformer Encoder 」 Module built . What needs to be pointed out here is , A key difference between the two is that :GPT-2 Just like the traditional language model , Output only one word at a time (token)

3. Environmental preparation

Operating environment reference :

centOS7 python3.6

Run the following command

yum -y install python36-devel
git clone
cd gpt2-chatbot
pip3 install -r requirements.txt

4.1 Corpus preparation and preprocessing

In the root directory of the project data Folder , The original training corpus is named train.txt, Store in this directory .train.txt The format is as follows , There is a line between each chat , The format is as follows :

Training materials can be downloaded using

https://github.com/codemayq/chinese_chatbot_corpus

 I really want to go to the movies with you 
 Suddenly miss you very much 
 I miss you,too 

 I want to see your beautiful photos 
 Kiss me and I'll show you 
 I kiss two 
 I hate people beating you on the chest with small fists 

Process and copy all chat records of colleagues append To the training file ( It is not known whether the sample weight can be properly )

function preprocess.py, Yes data/train.txt The dialogue corpus is used for tokenize, Then serialize and save to data/train.pkl.train.pkl The object serialized in is of type List, Record in the conversation list , Each conversation contains token.

python3 preprocess.py --train_path data/train.txt --save_path data/train.pkl

4.2 Training models

function train.py, Use the preprocessed data , Carry out autoregressive training on the model , The model is saved in the root directory model In the folder .

During the training , You can specify patience Parameters early stop. When patience=n when , If continuous n individual epoch, The model is on the validation set loss No decline , Is to early stop, Stop training . When patience=0 when , Don't make early stop.

python3 train.py --epochs 40 --batch_size 8 --device 0,1 --train_path data/train.pkl

5. other

In fact, there are still many parts that can be improved in the future , The project code has not been understood yet . Write a few at will and then have time to play :

  • Online learning
  • Try other pre training models
  • Comment on Weibo hot search every day

It is estimated that the group will be a little less quiet if it can improve these .

Ref

  1. https://zhuanlan.zhihu.com/p/96755231
  2. https://github.com/yangjianxin1/GPT2-chitchat
  3. https://github.com/sfyc23/EverydayWechat
  4. https://github.com/Morizeyao/GPT2-Chinese
  5. https://zhuanlan.zhihu.com/p/57251615
  6. paper https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
  7. https://github.com/openai/gpt-2
  8. The heart of the machine interprets the model https://www.sohu.com/a/336262203_129720
原网站

版权声明
本文为[Goose]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/09/20210914225613278y.html