当前位置:网站首页>Paper reading [quovadis, action recognition? A new model and the dynamics dataset]

Paper reading [quovadis, action recognition? A new model and the dynamics dataset]

2022-06-23 08:27:00 hei_ hei_ hei_

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

  • publish :2017 CVPR
  • Main contributions :(1) A large video data set is disclosed , It can be used for transfer learning and network training .(2) A new video action classification model is proposed I3D.

The previous model

a. ConvNet+LSTM

First use CNN Extract the spatial features of the image , Then input in sequence LSTM To extract temporal features from , The last hidden layer is used for action classification .
 Insert picture description here
ps: But the effect is not very good , Therefore, it is not popular

b. 3D-ConvNet

Input a video into , use 3D Convolution directly learns the spatiotemporal features of video . Two dimensional Conv and Pooling All for 3D Of
 Insert picture description here
ps: The number of parameters is huge , It is difficult to train for small data sets , But the effect is OK

c. Two-Strean

Use optical flow information ( The flow of light , That is, the motion track of the target in the video ) Time series feature modeling . The input of the convolution network on the left is one or more frames of images , Scene information for learning images ; The convolution network input on the right is the optical flow diagram of the video , It is used to learn the motion information of objects
 Insert picture description here
ps: The model is simpler , And easy to train , Just extract the optical flow graph of the video and learn the mapping of classification actions , It is widely used

d. 3D-Fused Two-Stream

b and c Combined version of , take c The weighted average in is replaced by 3D ConvNet
 Insert picture description here
summary : With sufficient data ,3DConv Than 2DConv Is much better , But there are still some things that you can't learn well ( Additional information such as optical flow diagram may be required to supplement )

Model framework

 Insert picture description here

(1)inflating

take 2D Network of “ inflation ” become 3D, Keep the architecture unchanged . The network architecture remains unchanged , Just will 2D Conv Switch to 3D Conv,2D Pooling Switch to 3D Pooling. In this way, you can directly use the previous 2D The Internet

(2)Bootstrapping

How to train good 2D The parameters of the model are right 3D The model is initialized . The basic idea is for the same input , The output of the two models should be consistent . Specifically, copy an image n One video at a time ,2D Parameters of are copied in time latitude n Time , Then the parameter is divided by n(rescaling, It is used to ensure the consistency of input and output )

(3) Model details

 Insert picture description here
ps: But now we basically use Resnet

原网站

版权声明
本文为[hei_ hei_ hei_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206230805275341.html