当前位置:网站首页>Nips 2014 | two stream revolutionary networks for action recognition in videos reading notes
Nips 2014 | two stream revolutionary networks for action recognition in videos reading notes
2022-06-25 08:26:00 【ybacm】
Two-Stream Convolutional Networks for Action Recognition in Videos
Author Unit: Visual Geometry Group, University of Oxford
Authors: Karen Simonyan Andrew Zisserman
Conference: NIPS 2014
Paper address: https://proceedings.neurips.cc/paper/2014/hash/00ec53c4682d36f5c4359f4ae7bd7ba1-Abstract.html
The following is a video sent by Li Mu's team (https://www.bilibili.com/video/BV1mq4y1x7RU?spm_id_from=333.1007.top_right_bar_window_history.content.click&vd_source=9a9f9a00848a88972d0fcfd341e9e738) Take notes .
Why do you do the classification task but the action recognition in the title ? Because videos related to human actions are more common , So in terms of practical significance or data set collection , It is more valuable to do motion recognition in video .
motivation : The author finds that convolution network is better at learning local information of images , But can't learn the motion information between frames , So the author puts forward the dual stream network , One is the spatial stream convolution network ( Single frame image input , In order to learn spatial information ), One is spatiotemporal stream convolution network ( Take multi frame optical flow picture as input , In order to learn sports information , It is equivalent to directly giving the motion characteristics to the convolution network ), So that the model can learn motion information explicitly .
Optical flow optical flow: It can effectively describe the motion between objects , By extracting optical flow , You can ignore the background and people's clothes , So you can focus on the action itself . You can see the picture of optical flow visualization , The greater the range of motion of an object , The brighter the color .
The reason for using optical flow pictures , It's also because we did better at that time hand-crafted The feature is based on the optical flow trajectory .
Abstract
Contribution has three aspects , First, a dual flow network is proposed ; The second is to prove that it is feasible to train on multi frame dense optical flow ConvNet It can achieve very good performance in the case of limited training data ; Third, I did multi task training , Is to train on two data sets at the same time , Then the accuracy is improved .
1 Introduction
Additionally, video provides natural data augmentation (jittering) for single image (video frame) classification.
Because the objects in the video will undergo various deformations , Changes in displacement or optical flow , It is much better than hard data enhancement .
late fusion It means at the end of the network logits Merge at this level ,early fusion It refers to merging the output of the network middle layer .
The two stream approach has two benefits : First, it can be used from ImageNet Use the pre trained model to initialize the spatial flow ; Second, time flow only uses optical flow information for training , It will be easier .
1.1 Related work
The paper [14] It is found that the effect of plugging multiple images into the network is basically the same as that of single frame input , Even with a huge data set Sports-1M So it is with .
2 Two-stream architecture for video recognition
The overall framework is shown in the figure 1 Shown . The author considers two fusion methods : First, direct average , The second is to stacked L2-normalised softmax scores Train a multiclass linear as a feature SVM [6].
Spatial stream ConvNet Although the network of spatial flow is very simple , But it's also important , Because the objects in the video appearance( shape , size , Color, etc. ) It's very important. , It is possible to recognize that there is a basketball in the video and know that the probability is playing basketball .
3 Optical flow ConvNets
The optical flow is calculated by two frames , If a video has L frame , Then you can get it L-1 An optical flow diagram , Suppose the input is two H×W×3, The resulting optical flow diagram is H×W×2, there 2 It refers to the optical flow in the horizontal direction and the optical flow in the vertical direction , Pictured 2(d)、(e).
3.1 ConvNet input configurations
So how to use this optical flow information , Pictured 3, The author proposes two ways : One is to directly stack the optical flow diagram at the same position ; The second is to know the position information after movement from the optical flow diagram , Continue the trajectory to superimpose the optical flow diagram . Although obviously the second way is more reasonable , But the effect is not as good as the former .
Bi-directional optical flow. Using the technique of two-way optical flow , Forward optical flow calculation is used in the first half of the video frame ( frame i To frame i+1), Backward optical flow calculation is used in the second half of the video frame ( frame i+1 To frame i), Maintain dimensional consistency .
The overall structure of the time flow network is consistent with the space flow , The input dimensions are different , Spatial flow is H×W×3, The time flow is H×W×2L( Dimension after optical flow diagram superposition ).
5 Implementation details
What should be noted :
Testing. While doing the test , The authors all take frames of fixed length at medium intervals in a video sequence ( Used here 25), For example, a video contains 100 frame , Every other day 4 Frame take once , Fetch 25 frame . And then to the 25 Frame done 10crop( Corner and center crop, Then flip the picture and take the four corners and the center crop, get 10crop), To obtain the 250view, Will this 250view Input into the spatial flow network ( The time flow is similar to ), Then take the average .
Optical flow. The author uses a technique to put dense The prediction of optical flow is transformed into sparse The flow of light , And use JPEG The compression method of , Save a lot of storage space .
6 Evaluation
Using the time flow network alone can achieve better results , It can be seen how important motion information is for video understanding !
7 Conclusions and directions for improvement
Made a summary and future work . It is mentioned that the author does not understand why the optical flow fusion based on trajectory is not better , And this was a 15 Year of cvpr It's solved .
The contribution of this article is not just to add a time flow network , It mainly gives us an enlightenment : When a neural network can not solve some problems , Maybe you can't get a good promotion through the magic network , At this time, it may be a good way to add a modal input to help the neural network learn , Therefore, the two stream network can also be used as a precedent for multimodal learning .
边栏推荐
- [unexpected token o in JSON at position 1 causes and solutions]
- Go language learning tutorial (13)
- C examples of using colordialog to change text color and fontdialog to change text font
- Luogu p3313 [sdoi2014] travel (tree chain + edge weight transfer point weight)
- Common SRV types
- What are the indicators of entropy weight TOPSIS method?
- What is the difference between agreement and service?
- With the beauty of technology enabled design, vivo cooperates with well-known art institutes to create the "industry university research" plan
- How to do factor analysis? Why should data be standardized?
- Deep learning series 48:deepfaker
猜你喜欢
Quickly build a real-time face mask detection system in five minutes (opencv+paddlehub with source code)
Wechat applet opening customer service message function development
The era of enterprise full cloud -- the future of cloud database
A solution to slow startup of Anaconda navigator
CVPR 2022 oral 2D images become realistic 3D objects in seconds
How to calculate the D value and W value of statistics in normality test?
Static web server
Stm32cubemx learning (5) input capture experiment
How to calculate the information entropy and utility value of entropy method?
leetcode. 13 --- Roman numeral to integer
随机推荐
Ffmpeg+sdl2 for audio playback
Socket problem record
2022年毕业生求职找工作青睐哪个行业?
4個不可不知的采用“安全左移”的理由
The first game of 2021 ICPC online game
想要软件测试效果好,搭建好测试环境是前提
What is SKU and SPU? What is the difference between SKU and SPU
DNS protocol and its complete DNS query process
[unexpected token o in JSON at position 1 causes and solutions]
PH neutralization process modeling
How to calculate the independence weight index?
企业全面云化的时代——云数据库的未来
Luogu p2839 [national training team]middle (two points + chairman tree + interval merging)
How to calculate the fuzzy comprehensive evaluation index? How to calculate the four fuzzy operators?
FFT [template]
Static web server
After using the remote control of the working machine, problems occurred in the use of the local ROS, and the roscore did not respond
[red flag Cup] Supplementary questions
Bluecmsv1.6- code audit
每日刷题记录 (三)