当前位置：网站首页>[mae]masked autoencoders mask self encoder

[mae]masked autoencoders mask self encoder

2022-06-23 17:34:00 【luemeon】

Catalog

Asymmetric coding - Decoding architecture ：

Method

technological process

Encoder

decoder

Partial Fine-tuning

Randomly mask the sub blocks of the input picture , Rebuild lost pixels

from MAE The pre - training model has good generalization performance

Asymmetric coding - Decoding architecture ：

The input of the encoder is not mask Sub block of ;

The decoder is lightweight

（ The decoder only works in the pre training of image reconstruction , Therefore, the decoder design can be independent of the encoder , And flexible and lightweight ）,

The input is the input of the encoder and is mask Partial location information ,

The output is the value of the missing pixel to be reconstructed .

Method

It differs from the classical self encoder in that ：

Asymmetric design ,

Make the encoder depend only on part Observation information ( No mask required token Information ,BERT need ),

The lightweight decoder is directly connected to The resulting implicit expression And Mask token Reconstruct the original signal

technological process

Cut the picture patch, Random Pick a few （ For example, in the text 25%） As network input ;
Input through encoder Get the corresponding encoded encoded patches
take encoded patches Restore to the corresponding original location , And in Fill in the missing part masked patches
Send in decoder, Every decoder The forecast corresponds to patch Image pixels ;
Calculate the distance between the predicted pixel and the pixel of the original picture MSE As loss.（ The loss function uses MSE, notes ： Be similar to BERT The loss is calculated only in the mask block .）
Take the training model of encoder As part of downstream tasks basemodel And under the downstream task finetune.

Encoder

use ViT framework , But it only works on the visible and not Mask The block .

Through the first Linear Projection code picture ,

Plus Location code ,

Then sent A pile of continuous Transformer Block Inside .

The output of the encoder will pass through reshape Build the reconstructed image ,

Reconstruction target The MAE The original information is reconstructed by predicting the pixel value of each mask block .

Because the encoding and decoding is only in small blocks ( such as 25%) To deal with , And no mask is used Token Information .

This makes Can train a very large encoder .

decoder

The input contains ： The whole picture patches aggregate

(1) Output of encoder ;

(2) Mask token.

Every mask tokens It's all one Shared 、 Learning vectors , It indicates that there is a to be predicted tokens

take Position insertion Add to this complete image patch In the collection all tokens in

The decoder also contains a series of Transformer modular .

The last layer of the decoder is the linear projection layer （Linear Projection）, The number of output channels is equal to Each piece of Pixels Number .

MAE decoder Only in Pre training stage be used for Image reconstruction , Encoder Is used to generate for distinguish Image representation of .

Decoder design Independent In coding design , With a high degree of flexibility

The reconstruction target is the normalized pixel value of each mask block .

Calculate the mean and standard deviation of each block , Normalize the block , Normalized pixels are used as reconstruction targets to improve the expression ability .

Simple implementation MAE Pre training is extremely efficient ：

1. adopt Linear projection Linear Projection And Location code For each input block Generate token;

2. Random substitution (random shuffle)token Sequence And according to Mask scale masking ratio remove Last part token;

3. After coding , hold unmasked patches Output to Encoder in , Get these tokens The representation of .

Insert mask in encoding block token Parallel inverse permutation (unshuffle) Get the whole sequence token To facilitate target Align ;

4. hold Encoder Output , combination masked tokens ( Learnable vectors ), perform unshuffle Operation recovery sequence , And then input them to Decoder in . The decoder is applied to the whole sequence token.

As mentioned above ：MAE No sparse operation is required . Besides ,shuffle And unshuffle It's very fast , The amount of computation introduced can be ignored .

Classified as ImageNet data , Detect yes COCO data , Segmentation has ADE data

Partial Fine-tuning

Put forward a kind of Partial Fine-tuning The new routine of , It is different from what people used to Linear Probing ( Only the parameters of the last linear classifier are trained ) and Fine-tuning ( Training parameters of all layers ).

Partial Fine-tuning Refer to Only the parameters of several layers of the final model are trained

Reference link ：

In depth analysis of he Kaiming's new works MAE： Lead to CV Big model _ Blog of Jishi platform -CSDN Blog

Masked Autoencoders Are Scalable Vision Learners Research on the paper _herosunly The blog of -CSDN Blog

原网站

版权声明
本文为[luemeon]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/174/202206231444080212.html