当前位置:网站首页>[cann training camp] learning notes - Comparison between diffusion and Gan, dalle2 and Party

[cann training camp] learning notes - Comparison between diffusion and Gan, dalle2 and Party

2022-07-23 09:08:00 Hua Weiyun

I heard about GAN Live class , Read the related articles , I want to use this note to make a summary . At the same time, this note is also a personal reflection on the third question of the advanced class of the training camp , The question is how to treat GAN and Diffision The development potential of , I think from now on SOTA Model starting is the most intuitive way to feel their ability , So there is this article . except DALLE2 and Parti, I also hope to sort out the front work they involve . Because I haven't thoroughly understood the field of image generation before , Time is short, and the content may also be flawed .

DALLE2

dalle2.png

As shown in the figure above ,Dalle2 The training of is divided into two stages . Use the upper half of the dotted line CLIP Contrast learning , To get a text encoder And a image encoder, They can encode words and pictures into vectors respectively and make pictures embedding And words embedding As similar as possible . The lower part is used for image generation , from prior and Decoder form .Decoder The role of will be image encoder The generated code generates the original picture in reverse ,Prior The title text or text embedding Mapping to image embedding In the space of .Decoder It's a diffusion model , and GLIDE be similar , But at the same time clip image embedding The mapping is added to the original input . The article gives two Prior Structure , Autoregressive and diffusion models . Under the manual judgment , The article uses two prior Respectively and GLIDE We found the authenticity of the diffusion model , The effect of Title Consistency and diversity is slightly better than that of autoregressive model

dalle2_2.png


Quantized FID The index also shows the advantages of the diffusion model

dalle2_3.png

Parti be based on Google Newly proposed Pathway Architecture to achieve efficient network training , The largest version has 200 Million parameters

parti.png

As shown in the figure above , The text of the model is Transformer Encoder code , In the middle of the Transformer Decoder take Text-to-Image Generate as a Seq2Seq Mission . And the picture is made by ViT (Vision Transformer) code ( Here's the picture )

vit.png

GAN and Diffusion Compare

GAN Because it is necessary to train generator and discriminator at the same time , It is difficult to balance , This makes the training unstable . by comparison ,Diffusion Just train a model , Optimization is easier . however Diffusion Of p The process needs to be completed step by step, which also affects the efficiency of its reasoning . stay Parti Used VQGAN And achieved better results than Diffusion Better results , But also pay attention Parti It has much more parameters than previous models , The pre trained text recognition model will also have a significant impact on the final result , It is difficult to explain whether the improvement of the overall performance of the model comes from GAN, stay Parti At the end of the article, the author also said that we can further consider using Diffusion and autoregression The combination of . In the field of image generation , Personal feeling diffusion Still dominant , however GAN Its application fields are more flexible and extensive , These are Diffusion Irreplaceable .

原网站

版权声明
本文为[Hua Weiyun]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/204/202207230040282416.html