当前位置:网站首页>Target segmentation for 10000 frames of video, with less than 1.4GB of video memory | eccv2022

Target segmentation for 10000 frames of video, with less than 1.4GB of video memory | eccv2022

2022-07-23 13:23:00 Xixiaoyao

6438f23f177a031acd5e384662cfb956.jpeg

writing | bright and quick From the Aofei temple
Source | qubits | official account QbitAI

Why , How good Fujiwara Qianhua , All of a sudden “ High temperature red version ”?

87d25d30b94fea0e24acda2a8817cef8.gif

This big purple hand , Is mieba alive ??

99fe5b1d252e3ecee8285b237e12f028.gif

If you think the above effects are just coloring the object later , That was really AI Cheated .

These strange colors , In fact, it is the representation of video object segmentation .

but u1s1, This effect is really indistinguishable for a time .

Whether it's cute girl's flying hair :

b505214f5fc6e3912fdda3605999f4d8.gif

Or a towel that changes its shape 、 Objects block back and forth :

c40fd3cc77e8287673766e611c3d082b.gif

AI The segmentation of the target can be called perfect , It seems that the color “ weld ” Go up .

It's not just high-precision segmentation of targets , This method can also handle more than 10000 frame In the video .

And the segmentation effect is always maintained at the same level , The second half of the video is still silky and fine .

2e72066e1d0bfc577f8306fe4fd75c39.png

What's more surprising is , This method is right GPU Not very demanding .

The researchers said that during the experiment , This method consumes GPU Memory never exceeds 1.4GB.

Need to know , Current similar methods based on attention mechanism , You can't even process more than... On ordinary consumer graphics cards 1 Minute video .

This is the University of Illinois Urbana - A long video target segmentation method recently proposed by scholars at the University of champagne XMem.

At present has been ECCV 2022 receive , The code is open source, too .

Such a silky effect , still Reddit Attract many netizens to watch , Heat up to 800+.

4d10e5803055fc6698178695d0e46c0c.png

Netizens are joking :

Why paint your hands purple ?
Who knows if mieba has a hobby in computer vision ?

c8574c1e0868c7753b1b42e020f25716.png

Imitate human memory

At present, there are many video object segmentation methods , But they are either slow to process , Or yes GPU Demand is high , Or the accuracy is not high enough .

And the method proposed in this paper , It can be said that the above three aspects are taken into account .

It can not only quickly segment long videos , The number of frames can reach 20FPS, At the same time, in ordinary GPU I can finish it .

What's special about it is , It is inspired by human memory patterns .

1968 year , Psychologists Atkinson and schifflin proposed Multiple storage model (Atkinson-Shiffrin memory model).

The model says , Human memory can be divided into 3 Patterns : Instantaneous memory 、 Short term memory and long term memory .

Refer to the above mode , Researchers AI The framework is also divided into 3 Memory mode . Namely :

  • Instant memory updated in time

  • High resolution working memory

  • Dense long-term memory .

b699ba06bd8c578c7b8361e94486dbe1.png

among , The transient memory is updated every frame , To record the image information in the picture .

Working memory collects picture information from transient memory , The update frequency is every r Frame once .

When the working memory is saturated , It will be compressed and transferred to long-term memory .

When the long-term memory is saturated , Will forget outdated features over time ; Generally speaking, this will be saturated after processing thousands of frames .

thus ,GPU Memory will not be insufficient due to the passage of time .

Usually , Segmentation of the video target will give the image of the first frame and the target object mask , Then the model will track the relevant targets , Generate corresponding masks for subsequent frames .

The specific term ,XMem The process of processing a single frame is as follows :

94ded925b2dc6fffa3cd0bb96c2a1cfa.png

Whole AI Frame by 3 An end-to-end convolution network .

One Query encoder (Query encoder) Used to track, extract and query specific image features .

One decoder (Decoder) Responsible for obtaining the output of the memory reading step , To generate an object mask .

One Value encoder (Value encoder) You can combine the image with the mask of the target , So as to extract new memory characteristic values .

The characteristic value extracted by the final value encoder will be added to the working memory .

From the experimental results , This method is applied to short video and long video , It's all done SOTA.

2dd0aa36ff4b6515a5313721bb7aca9c.png

When processing long videos , As the number of frames increases ,XMem The performance of has not decreased .

9580f8bb1f5f2bf402d5a2b740336129.png

Research team

One of the authors is Chinese Ho Kei (Rex) Cheng.

7dfae196cf7f30e445d53312647b5846.png

He graduated from Hong Kong University of science and Technology , At the University of Illinois, Urbana - A doctoral degree at the University of champagne .

The research direction is computer vision .

Many of his papers have been CVPR、NeurIPS、ECCV Wait for the top to receive .

Another author is Alexander G. Schwing.

0252f59086b843320d72b2f0202c6de4.png

He is now at the University of Illinois, Urbana - Assistant professor at the University of champagne , He graduated from the Federal Institute of technology in Zurich .

His research interests are machine learning and computer vision .

Address of thesis :
https://arxiv.org/abs/2207.07115

GitHub:
https://github.com/hkchengrex/XMem

3518f7053c025798b39ca884e8276f60.jpeg Backstage reply key words 【 The group of

Join selling cute house NLP、CV、 Search promotion and job search discussion groups

原网站

版权声明
本文为[Xixiaoyao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/204/202207230603186053.html