当前位置：网站首页>Excess rlsp

Excess rlsp

2022-06-21 17:49:00 【Ton10】

Insert picture description here

This article yes 2019 Year of ICCVW, It only pursues speed for the real-time performance of video , Give up expressiveness . The author proposes an efficient VSR Model ——Recurrent Latent Space Propagation(RLSP), It is a typical No alignment Method , So compared with the classic VSR Those are based on flow perhaps DCN For the model of , Its opposite Efficient .RLSP Will VSR Modeling as RNN Model , Its core is Shuffling and Hidden-state.

Reference documents ：
① Source code
② Video super score ：RLSP（Efficient Video Super-Resolution through Recurrent Latent Space Propagation）

Efficient Video Super-Resolution through Recurrent Latent Space Propagation

Abstract
1. Introduction
2. Related Work
3. Method
4. Experimental Setup
5. Results and Discussion
6. Conclusion

Abstract

RLSP Is an unaligned VSR Method ： The biggest benefit of no alignment is efficiency , Fast ; A defect is a deficiency ——PSNR Relative to aligned VSR The model will be worse .
The author gives 3 There are three reasons to explain Introduction RLSP The necessity of ：

The real time . Either explicit or implicit motion compensation , Will occupy a certain amount of computing resources and video memory requirements , Therefore, canceling the alignment module can speed up VSR The speed of reconstruction and certain savings GPU-Memory.
FLow-based Alignment method It is highly dependent on the accuracy of motion estimation . Once the motion estimation is not accurate, it will introduce artifacts; Whether it's Flow-based still Flow-free All alignments of the are interpolated , The inevitable loss of high frequency details .
When data set When the range of motion is small , Adjacent frames are very close , No alignment has little effect .

Therefore, the author puts forward a method based on RNN Of VSR Model ——RLSP, It models video hypersegmentation as a sequence problem (Sequence-to-Sequence), It has the following characteristics ：

No alignment , Fast , Real time , It is better than the DUF Be close 70 times ！
RLSP The core of is to use a high-dimensional ( $C = 128$ ) The hidden state of $h$ To spread the characteristic information of the past ; And use based on ESPCN Proposed PixelShuffle To complete Up sampled Shuffle_up and Feedback Of Shuffle_down.

1. Introduction

Complex motion compensation often requires expensive computing resources , Therefore, such a model does not have real-time occasions , For example, the game is super divided into fields .
RLSP It is designed for real-time requirements VSR Model . differ VESPCN、TDAN、Robust-LTD、EDVR These uses sliding-windows The characteristic transmission mode of ,RLSP Based on cyclic network structure , Belonging to one-way characteristic propagation VSR Method , The concrete propagation is through the hidden state (Latent-State) To do the .

The picture below is RLSP、FRVSR、DUF stay PSNR-Runtime The results of the experiment on ：
Insert picture description here

As can be seen from the above figure RLSP Than FRVSR and DUF Separate quickly 10 Times and 70 About times .
“7-128” Means to use after fusion 7 Layer of CNN The Internet , Use of each layer 128 A filter , therefore RLSP The performance of can be achieved by increasing the complexity of the network .

2. Related Work

A little

3. Method

For each iteration ,RLSP The goal of is to put the current frame $x_t\in\mathbb{R}^{H\times W\times C}$ Over score to $y_t\in\mathbb{R}^{rH\times rW\times C}$ . About RLSP Of pipeline As shown in the figure below , because RLSP be based on RNN structure , So the most important part is the purple box in the figure below ——RLSP-Cell：
Insert picture description here
Next, let's roughly describe RLSP Of pipeline( Suppose that each batch The number of frames for is 10, Input is RGB Images $64\times 64$ , The super resolution is $r = 4$ , Number of filters $f = 128$ )：

RLSP learn sliding-windows Experience ： Place the front and rear adjacent to each other 1 Channel fusion between frame and current frame , The difference is that the alignment process is eliminated directly , The reason why we dare to do this is to assume that the similarity of adjacent frames is high . therefore cell One of the inputs to is $\mathbb{R}^{b\times 3\times 3\times 64\times64}$ .
Cell The second input of the is the result of the super division from the previous frame $y_{t-1}$ —— $\mathbb{R}^{b\times 3\times 256\times 256}$ adopt shuffle_down Later results —— $\mathbb{R}^{b\times (3*4*4)\times 64\times 64}$ .
Cell The third input to is the last hidden state $h_{t-1}$ —— $\mathbb{R}^{b\times 128\times 64\times 64}$ . Same as RNN Like the cells in the structure , The hidden state is also predicted through some full connection layer or convolution layer . Because it is a recursive loop process , So we directly analyze $h_t$ The birth of . As shown in the purple box above ,cell Altogether $n = 7$ layer , The first layer is to integrate the three parties after the channel fusion $(3*3+f+3*r^2)\times 128\times (3*3) \times 1\times 1$ Convolution of ; Next 5 Layer convolutions are all $128\times 128\times (3*3)\times 1\times1$ ; The third convolution is $128\times (3*r^2+f)\times (3\times 3)\times 1\times 1$ , Then from the channel into 2 part , One of them passes through Relu Output $\mathbb{R}^{b\times 128\times 64\times 64}$ —— $h_t$ , Another piece and $x_t^*$ Add to do Residual connection Output $\mathbb{R}^{b\times (3*r^2)\times 64\times 64}$ , among $x^*_t$ yes $x_t$ Copy $r^2$ Times the result —— $\mathbb{R}^{b\times (3*r^2)\times 64\times 64}$ ：
Shuffle-up Equivalent to PixelShuffle The process of ; and Shuffle-down Is and Shuffle-up The opposite process , Be similar to Understanding DCN-Alignment in VSR The process of deformable convolution expression with unity in .Feedback Used at shuffle-down To downsampling ,shuffle-up For $x_t\to y_t$ Upper sampling part of .

Note：

The residual connection allows the network to learn the residual part directly , So as to make the training more stable ; In addition, direct $x_t$ The information is added to you CNN Loss of information .
RLSP Each time only the score is exceeded 1 frame .

3.1 Shuffling

Shuffling It mainly includes shuffle-up To sample and shuffle-down To downsampling .
Shuffle-up The principle is ESPCN Of Subpixel convolution layer , It does not change the pixels , Instead, all pixels on the channel copy And combine to produce ：
Insert picture description here

$\colorbox{springgreen}{Shuffle-up}$
$t^{LR} \in \mathbb{R}^{H\times W\times Z} \;\;\;\mathop{\rightarrow}\limits^{\times r}\;\;\; t^{HR} \in \mathbb{R}^{rH\times rW \times Z/r^2}.\tag{1}$ Source code ：

def shuffle_up(x, factor):
    # format: (B, C, H, W)
    b, c, h, w = x.shape

    assert c % factor**2 == 0, "C must be a multiple of " + str(factor**2) + "!"

    n = x.reshape(b, factor, factor, int(c/(factor**2)), h, w)
    n = n.permute(0, 3, 4, 1, 5, 2)
    n = n.reshape(b, int(c/(factor**2)), factor*h, factor*w)

    return n

$\colorbox{orange}{Shuffle-down}$
$t^{HR} \in \mathbb{R}^{H\times W\times Z} \;\;\;\mathop{\rightarrow}\limits^{\times r}\;\;\; t^{LR}\in \mathbb{R}^{H/r \times W/r \times r^2 Z}.\tag{2}$ Source code ：

def shuffle_down(x, factor):
    # format: (B, C, H, W)
    b, c, h, w = x.shape

    assert h % factor == 0 and w % factor == 0, "H and W must be a multiple of " + str(factor) + "!"

    n = x.reshape(b, c, int(h/factor), factor, int(w/factor), factor)
    n = n.permute(0, 3, 5, 1, 2, 4)
    n = n.reshape(b, c*factor**2, int(h/factor), int(w/factor))

    return n

3.2 Residual Learning

Is in the Cell Lieutenant general $x_t^*$ and CNN Combined with the output of , Make the network learn the residual part ; In addition to alleviating the gradient vanishing problem, the residual connection can increase a certain degree of stability , Let the learning range of residuals be narrowed so as to reduce the variance ; In addition, due to CNN Will inevitably attenuate the input information , Therefore, adding the input directly also helps to save the original input information .

3.3 Feedback

Feedback Will be $y_{t-1}$ Conduct shuffle-down Earth process , Because adjacent frames are highly correlated , Therefore, the fusion of this part of information is also helpful to the current frame $x_t$ Superscription of .

3.4 Hidden State

Insert picture description here

and RNN equally , Hidden state $h_{t-1}$ Remember the characteristic information of the past , It combines with the current frame information to use the past feature information to help the current frame's super segmentation process . stay RLSP in , The author used 7 Layer by layer to learn hidden-state, The final output format is ： $\mathbb{R}^{b\times f\times 64\times 64},f=128$ .

3.5 Loss

RLSP The loss function is MSE：
$\mathcal{L} = \frac{1}{k}||y^* - y||^2_2.\tag{3}$

4. Experimental Setup

When I reproduce, the relevant experimental configuration is as follows ：

params = {
    "lr": 10 ** -4,
          "bs": 2,
          "crop size h": 64,
          "crop size w": 64,
          "sequence length": 5,
          "validation sequence length": 20,
          "number of workers": 8,
          "layers": 7,
          "kernel size": 3,
          "filters": 128,
          "state dimension": 128,
          "factor": 4,
          "save interval": 50000,
          "validation interval": 1000,
          "dataset root": "./dataset/",
          "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
          }

Because the data set is not clearly written in the source code , And there is a problem with the content read from the data set in the source code , So I did 2 There are two changes ：

Use REDS Data sets , The location of the dataset is as follows ：
Use PIL.Image.open() To read the picture .

5. Results and Discussion

5.1 Ablation

In addition to residual connections ,RLSP Also used. 3 It's about tips：

Adding adjacent frames.
Feedback.
Hidden-state.

In order to study the above 3 A point is right RLSP Influence ,ablation The experimental results are as follows ：
Insert picture description here
The first one is that all frames are processed independently ; The second is to add adjacent frames ; The third is to add feedback—— $y_{t-1}$ ; The fourth item is to add feedback And hidden state $h_{t-1}$ . The results are as follows ：

Above 3 Points for RLSP The improvement of expressiveness is helpful , But it also increases the amount of calculation in turn .

5.2 Temporal Consistency

A little

5.3 Information Flow over Time

A little

5.4 Initialization

A little

5.5 Accuracy and Runtimes

The experiment is in Vid4 Verify on , Tested on vid4 On the whole sequence of .
The average of the final statistics PSNR Is the average of vid4 in 4 Results of video sequences ; The statistical runtime Is the time required for each frame reconstruction (ms).
The final goal of recovery is 2K Video sequence of .

The experimental results are as follows ：
Insert picture description here
The results are as follows ：

RLSP-7-128 The processing time of each frame is 38ms, so 1s Can handle 25 frame , This illustrates the RLSP-7-128 Reached 25fps The real-time requirements of .
RLSP-7-128 At the beginning PSNR Lower because it is a one-way propagation model , You can only use information from the past , This means that there is less information available at the beginning , There is more information available in the later stages , So like Figure 8 Shown , Naturally, the first few frames will have their own PSNR The lower , It will rise later —— That is, the unfairness of information utilization , You can solve this problem by adding backward branches .
By increasing the cell To increase the number of filters in RLSP Expressive force .

The visualization results are as follows ：
Insert picture description here

6. Conclusion

In this paper, we propose an unaligned VSR Model ——RLSP. It will be VSR Modeling as Seq2Seq problem , To build RNN Structure to achieve video super division .
RLSP use ①Shuffling;②Residual-Learning;③Feedback;④Hidden-state, common 4 individual tips To achieve PSNR The promotion of .
RLSP The greatest characteristic is that the higher PSNR To increase the speed , Its 7-128 The model can just meet the real-time requirements ; By increasing the Cell The nonlinearity of ( Lift depth or width ) To enhance RLSP Expressive force .

原网站

版权声明
本文为[Ton10]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/172/202206211551442153.html