当前位置:网站首页>Theoretical analysis of countermeasure training: adaptive step size fast countermeasure training

Theoretical analysis of countermeasure training: adaptive step size fast countermeasure training

2022-06-24 23:07:00 PaperWeekly

bef1b20e14807f89df56ff4ab0556418.gif

PaperWeekly original ·  author |  guiguzi

e44290e16a610968c90f42b75a3ee69a.png

introduction

This paper is about the theoretical analysis of confrontation training , At present, confrontation training and its variants have been proved to be the most effective means to resist confrontation attacks , But the process of confrontation training is extremely slow, which makes it difficult to expand to such areas as ImageNet On such a large data set , And in the process of confrontation training, the model is often over fitted . In this paper , The author studies this phenomenon from the perspective of training samples , The research shows that the over fitting phenomenon of the model depends on the training samples , And the training samples with larger gradient norm are more likely to lead to catastrophic over fitting . therefore , The author puts forward a simple but effective method , That is, adaptive step counter training (ATAS).

ATAS Learning to adjust the training sample adaptive step size which is inversely proportional to its gradient norm . Theoretical analysis shows that ,ATAS It converges faster than the commonly used non adaptive algorithm , When evaluating various counter disturbances ,ATAS It can always reduce the over fitting phenomenon of the model , And the algorithm in CIFAR10、CIFAR100 and ImageNet And other data sets to achieve higher model robustness .

34e31eca6c82f326ccbc1a26d2fb5fda.png

Paper title :

Fast Adversarial Training with Adaptive Step Size

Thesis link :

https://arxiv.org/abs/2206.02417

97003910c88e49c18caeaa278f070bc0.png

Background knowledge

FreeAT Firstly, a method of fast confrontation training is proposed , Through batch repeated training and simultaneously optimizing model parameters and resisting disturbance .YOPO A similar strategy is used to optimize the countermeasure loss function . later , The one-step method is proved to be better than FreeAT and YOPO More effective . If you carefully adjust the super parameters , With random start FGSM(FGSM-RS) It can be used to generate anti disturbance in one step , To train the robust network model .ATTA The method is to take advantage of the mobility of the counter samples , Use the clean sample as the initialization of the counter sample , The specific optimization form is as follows :

c7825c45b51c06038a33c7e49c5bdac7.png

among , Said in the first   In the middle of the round   Samples Generated countermeasure samples .ATTA Show and FGSM-RS Fairly robust accuracy .SLAT And FGSM Simultaneous disturbance of input and potential values , Ensure more reliable performance . These one-step methods can lead to disastrous over fitting , This means that the model is right PGD The robustness accuracy of the attack will suddenly drop to close to 0, And yes FGSM The robust accuracy of the attack is improved rapidly .

To prevent over fitting of the model ,FGSM-GA Added a regularizer , Used to align the direction of the input gradient . Another work studies this phenomenon from the perspective of loss function , It is found that the excessive phenomenon of the model is the result of the high distortion of the loss surface , A new algorithm is proposed to solve the model over fitting by checking the loss value along the gradient direction . However , Both algorithms need to be better than FGSM-RS and ATTA More computation .

252a62e9fd011f14e6b3b65a24584c5f.png

The paper algorithm

According to previous studies , The internal maximized step size in the counter training objective function plays an important role in the performance of the single step attack method . Too large a step size will cause all FGSM The counter disturbance is attracted near the classification boundary , Leading to catastrophic overfitting , therefore PGD The robustness accuracy of the classifier against multi-step attack will be reduced to zero . However , You can't simply reduce the step size , Because as shown in the first and second figures in the following figure, we can find , Increasing the step size can enhance the resistance to attacks and improve the robustness of the model .

4004c5c2a608357fbeea580efdb9574b.png

In order to strengthen the attack as much as possible and avoid catastrophic over fitting , For samples with large gradient norm , The author uses a small step length to strengthen the attack and prevent the model from over fitting ; For samples with small gradient norm , The author uses stride length to strengthen the attack . therefore , The author uses the moving average of gradient norm :

a05f5f10a742df106979684d2d644ec3.png

To adjust in For samples in the wheel Step size of . yes The initialization , Is the momentum balance factor . And Inversely proportional :

c94bd057bc6b8130d97bed103d9571f3.png

among Is the predefined learning rate , It's a prevention Too large a constant . The author will adapt the step size And FGSM-RS Combination ,FGSM-RS Random initialization against disturbance in internal maximization attack . From the third subgraph of the above figure, we can find , The adaptive step size will not be fitted . Besides , The average step size of the adaptive step size method is even larger than FGSM-RS The fixed step size in is even larger , So it is more aggressive and more robust .

Random initialization limits the disturbance resistance of samples with small step size , This reduces the strength of the attack . Combined with the previous initialization method , The method proposed in this paper ATAS No need for big To achieve the whole Norm ball . For each sample , The author uses adaptive step size And perform the following internal maximization to obtain the countermeasure sample :

2d3e926cdc884c3a2140d42bdecb1847.png

among It's No The counter sample of the wheel , Parameters By sample To update , The formula is as follows :

fbaa4f38e8f09d6b1bb3be605b095af2.png

Compared with the previous methods that need a lot of computational overhead to solve the catastrophic over fitting problem , Proposed by the paper ATAS Method overhead is negligible ,ATAS Training time and ATTA and FGSM-RS Almost the same .ATAS The detailed algorithm of is as follows :

401beae18a5b96fb57b89819d31fca1f.png

stay ImageNet On dataset ATAS The detailed algorithm of is as follows :

6da591790f95c08054ec91c4845018da.png

The author analyzes ATAS Method in Convergence under norm , Give the following objective function :

869b76ccb30c8352fa3e363367f4e5cf.png

The min max problem can be formulated as follows :

1326cb25a1e8b530225c80dac145b084.png

among Is in the parameter The optimal countermeasure sample under . The author considers the minimax optimization problem under the conditions of convex concave and smooth , And the loss function The following assumptions are met .

hypothesis 1 Training loss function Satisfy the following constraints :

1) Is a convex function and In the parameter The bottom is smooth ; and The gradient is in The norm satisfies the following formula :

b5e0702c1215d9c758571d85aecd7468.png

among :

437b977dd9b38a4f942cefa465f3cfab.png

2) Is a concave function and In each sample smooth . stay In the norm sphere and the radius is . For any and ,, And the input gradient satisfies the following formula :

59139d8d82a88ca6934ac316cc1cd1a8.png

The average author Step parameter trajectories are approximated to optimality :

5ded2e0608ab15fcd32353e4da3a6be7.png

This is the standard technique for analyzing random gradient methods , Convergence gap :

d2553693ba5e52463edd0f1c83812db8.png

The upper bound is shown by the following formula :

7a05b146384353980fa72601cc86697e.png

lemma 1 Loss function Satisfy assumptions 1, Objective function There is the following convergence gap inequality :

56967840b85af0c6514d9ffcd23ff298.png

Prove bright : According to lemma 1 The following inequality can be obtained on the left side of the formula :

d8bd6545c63d8e9782d6f77c84aa8a7e.png

The first and third inequalities follow optimality conditions , The second inequality uses Jensen inequality . In proving the theorem 1 And Theorem 2 when , There are several gradient symbols :

f087e4479ff3f62758cbaa06a4e0ecb7.png

ATAS The method can also be expressed as an adaptive random gradient descent block coordinate rise method (ASGDBCA), Steps in Randomly select a sample from the list , For parameters Apply random gradient descent , For input Apply adaptive block coordinate rise . And SGDA Different ,SGDA Update in each iteration All dimensions of ,ASGDBCA Update only Some dimensions of .ASGDBCA First, calculate the pre adjustment parameters by :

f89ac49b061559d8befc9058376d6dd9.png

be and Can be optimized to :

1f2334bbdc8e7fd11554cb39b94817a8.png

ASGDBCA and ATAS The main difference is . In order to prove ASGDBCA The convergence of , The pre adjustment parameter must be non decreasing . otherwise ,ATAS Maybe not like ADAM That convergence . However ,ADAM The non convergent version of is actually more effective for neural networks in practice . therefore ,ATAS Still use As a pre-regulation parameter .

Theorem 1 Assuming 1 Under the condition of , Yes and , be ASGDBA The bounds of are as follows :

a02500deb362054cb8ec55cc6ab272e8.png

prove : Make , In the Step by step ,ASGDBCA from The index of random sampling subscript in is The sample of , So there is :

94d72b9770696560b62b06b223e0ae80.png

Make :

4c22fb41383df8ec37e90a96ea2f7cd6.png

Expressed as sample All coordinate pre adjustment parameters . Place the expectation on the right above , So there is :

861112693776a94cf2c84c687163cae1.png

And there are :

a65f09c5ab4fa966f603ea510b90df71.png

And SGDBCA The proof process is similar to , There is the following derivation process :

e7995dc3e094e203d86123dadb7d6a42.png

Transform inequality from Sum to , The upper bound of is expressed as :

9a29c31f1ed930cebbb3b4538a40307e.png

and :

fd0aa662f08bbb1f01e44950a4375746.png

And SGD be similar , Using arithmetic and geometric mean inequality, we know , When Reach the optimum , So there is :

bcbbe27b760c34a17b51c0bf91a60738.png

about The first item of is :

d88bf49e076d42ba62d94feb56e98184.png

among Express Of the Coordinate system , So for It is assumed that , So there is :

96006f0db789d987444717626a0e49da.png

about The second item of is :

81fb14c7ea9512d4ab6c837e921c1f77.png

among Express pass the civil examinations A coordinate . Yes Sum items , The upper bound of the second term of is :

8b77a9e51503631f3c7445114b63d3d9.png

The upper bound of the third term of is :

24db0abc58ef6a520b2c96cdc2b66d95.png

So there is :

fc58d03d86ebdc4007eb103c5b582a99.png

By combining the above inequalities, we can know ,ASGDBCA The upper bound of is :

3c82672661f52b88759c25f8a11b4533.png

Using arithmetic and geometric mean inequality, we know , When when , The upper bound can reach the minimum :

235b9b8f728ba53ac1726cec7bf045aa.png

Combine and , Then we can see ASGDBCA The upper bound of is :

a63463863610810108f80573663d6254.png

ATAS and ATTA The non adaptive version of the random gradient descending block coordinate rising formula is as follows :

7f0c6a9ef49f2e7f10e9fbffd2768457.png

Theorem 2: Assuming 1 Under the condition of , Constant learning rate and , be SGDBCA The upper bound of is as follows :

3fb10b39fa241301f4df84977b7cfa6c.png

prove : Make , In the Step by step ,SGDBCA Index set from subscript The subscript is randomly sampled in And update the anti disturbance , Then there are the following inequalities :

606521290b872e192e82a25a5f9609ef.png

therefore :

102af14cd7fe0314461d014c1dcb8cf4.png

Reorder the above inequalities , Then there are :

086b796a28afcada6a3a1817ca22efbf.png

Similar can be obtained :

7bdbea5cc3db72c7217ffa8c0024e6ad.png

Calculate the expectation of the left part of the above two formulas to get :

b15e89a67eb6ca954346a50740bd1934.png

Then there are :

43e19d9b24f83ebda347dfcd42bf312d.png

and :

b1e5cc44c68771baaf539d2f250874dc.png

in consideration of and Concavity and convexity of :

d6e622ca96fcaeb9d7e5d11cb6d403ec.png

You can get :

1dd6f8b6d15dde21aa5acc4c561bbce0.png

Combining the above inequalities, we can get :

8bcd0ce49e3c829325771159152beed9.png

to update , You can get :

031b8b23b3dddf1e1f127b40f8a92947.png

The above inequalities can be reordered as :

ef95818eeb5556beca19ceb657da5770.png

Both sides of the above inequality are divided by , Then there are :

27fae2ad9dc5509ee859dbfc9b6968af.png

Yes The sum of the terms gives the following upper bound :

f79335bd044512f1645f0dbd5fcf6f71.png

The above inequality can be reduced to :

6b2c347675ba60297fe6a11d5a9e6735.png

Using arithmetic and geometric mean, we can get , When and   yes , The optimal upper bound can be obtained :

4c71949a1770a45f17e4930e207db223.png

Theorem 1 and 2 indicate ASGDBCA Than SGDBCA Convergence is faster . When large ,SGDBCA and ASGDBCA The third term of the interval in is negligible . Considering that their first term is the same , The main difference is that in the second item and

402 Payment Required

About Interval boundary . Their ratios are as follows :

e098c0785f6065f752f21659c91409fb.png

Cauchy-Schwarz The inequality shows that the ratio is always greater than 1. When With long tail distribution ,ASGDBCA and SGDBCA The gap between them will become even bigger , This shows that ATAS The convergence speed of is relatively faster .

fe7c57cf5ad9163160ec09d3dfbbf69a.png

experimental result

The following three tables show different methods in CIFAR10、CIFAR100 and ImageNet Accuracy and training time on the dataset . It should be noted that , Because of the computational complexity , The author does not have enough computing resources in ImageNet Perform standard confrontation training and SSAT. The author uses two GPU To train ImageNet Model of , about CIFAR10 and CIFAR100, The author in a single GPU Assess training time on . From the following results, we can intuitively find the methods proposed in this paper ATAS Improved in various attacks ( Include PGD10、PGD50 And automatic attacks ) The robustness of the classification model under , And it can be found that catastrophic over fitting of the model can be avoided in training .

97a57d6dde77060810b41a08170f2d8b.png

6244789a50a5484229bf3a166c5d5962.png

ac6bdd78c9d06b2a8bcefd01a2865926.png

As shown in the figure below , It can be found that equivalence increases ATTA When the training step in ,ATTA and PGD10 The gap between the loss functions becomes smaller . Besides , When the step size is not too large , The robust accuracy of the classification model will increase with the step size . Then we can draw a preliminary conclusion , Larger step sizes also enhance ATTA The ability to attack . However , Large steps can also lead to ATTA Model over fitting occurs .

1451012d0356b40665ff519adc0e6344.png

The method in the paper ATAS The adaptive step size in allows for a larger step size , It will not lead to catastrophic over fitting of the model . As shown in the figure below, the author shows ATTA and ATAS Comparison between . Even if ATAS The step size of is greater than ATTA, It won't look like ATTA In that case, the model is over fitted .

0c0d43d7d0cebc9f80a9f593a1d75899.png

Read more

66c714cb8002cb1a828aff5522f6b803.png

84760a6735ecaee49b4e2ece8ccb6296.png

e4aa6a65b257c9991f0025b27b28a09b.png

fdedb146e093dd388fae31d4068bedbb.gif

# cast draft   through Avenue #

  Let your words be seen by more people  

How to make more high-quality content reach the reader group in a shorter path , How about reducing the cost of finding quality content for readers ? The answer is : People you don't know .

There are always people you don't know , Know what you want to know .PaperWeekly Maybe it could be a bridge , Push different backgrounds 、 Scholars and academic inspiration in different directions collide with each other , There are more possibilities . 

PaperWeekly Encourage university laboratories or individuals to , Share all kinds of quality content on our platform , It can be Interpretation of the latest paper , It can also be Analysis of academic hot spots Scientific research experience or Competition experience explanation etc. . We have only one purpose , Let knowledge really flow .

  The basic requirements of the manuscript :

• The article is really personal Original works , Not published in public channels , For example, articles published or to be published on other platforms , Please clearly mark  

• It is suggested that  markdown  Format writing , The pictures are sent as attachments , The picture should be clear , No copyright issues

• PaperWeekly Respect the right of authorship , And will be adopted for each original first manuscript , Provide Competitive remuneration in the industry , Specifically, according to the amount of reading and the quality of the article, the ladder system is used for settlement

  Contribution channel :

• Send email :[email protected] 

• Please note your immediate contact information ( WeChat ), So that we can contact the author as soon as we choose the manuscript

• You can also directly add Xiaobian wechat (pwbot02) Quick contribution , remarks : full name - contribute

d87ce6c7f29165acea90e8602887978c.png

△ Long press add PaperWeekly Small make up

Now? , stay 「 You know 」 We can also be found

Go to Zhihu home page and search 「PaperWeekly」

Click on 「 Focus on 」 Subscribe to our column

·

ce3c4153f70a2f34fbce91a952799dd3.png

原网站

版权声明
本文为[PaperWeekly]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/175/202206241712303189.html