当前位置：网站首页>Understand dynamic calculation diagram, requires_ grad、zero_ grad

Understand dynamic calculation diagram, requires_ grad、zero_ grad

2022-07-24 18:33:00 【blanklog】

Dynamic computing graph is built in the process of forward propagation of program , It is mainly used for back propagation . Compared with the calculation method of each layer when building the network structure , The main perspective of the calculation diagram is the data node (Tensor).

There are some confusing concepts in the process of computational graph construction and back propagation , for example is_leaf、requires_grad、detach()、zero_grad()、 retain_grad()、torch.nograd(). Understand these concepts from the perspective of computational graph back propagation , Everything becomes clear .

Back propagation in dynamic graphs

chart 1 Dynamic calculation diagram

The above figure is the schematic diagram of the calculation diagram ：X1 and X2 There are two sets of input data Tensor,P1 and P2 Is the weight of the network Tensor,Y and Z Is the intermediate result of the calculation Tensor,Fn It is the calculation operation to get the intermediate result .

The ultimate goal of the training network is to update P1 and P2 Value , So we need to calculate loss About P1 and P2 Gradient of , In order to get information about P1 and P2 Gradient of , It needs to be calculated in sequence loss About intermediate results Y and Z Gradient of .
The weight Tensor The gradient calculation of does not depend on input data and input data X There is no need to update the value , therefore X1 and X2 There's no need to calculate the gradient .

So by default, user created input data requirs_grad=False, Weight parameters of the network requirs_grad=True

1. Leaf node

Created directly by the user's hand of God Tensor Is a leaf node , These nodes have no records grad_fn Parameters ( For example, input data network weight ).
It is derived from the leaf node that needs gradient calculation through operation Tensor Is a non leaf node , These nodes have grad_fn Parameters .

Leaf nodes are nodes that are on the periphery of or outside the computational graph , They are the end of back propagation .

For example, in the following code X and XX,P1,P2（ chart 2 Medium green dot ） Are leaf nodes ,Y and loss（ White dots in the picture ） Are non leaf nodes .

p1=torch.tensor([1.0,2.0,3.0],requires_grad=True) 
p2=torch.tensor([1.1,2.2,3.3],requires_grad=True) 
x=torch.tensor([4.0,5.0,6.0],requires_grad=False)
xx=x**2
y1=p+xx
y1=y1.detach()
y2=torch.sigmoid(y1) y3=y2+p2 y4=torch.sigmoid(y3) loss=y4.mean()

chart 2 Leaf node diagram

2. requires_grad

This attribute indicates the current Tensor Need to calculate from loss Gradient of , Because leaf nodes P1 and P2 You need to calculate the gradient update parameters , All derived from them to loss Access to ( In the middle ) You need to calculate the gradient (requires_grad==True).

For example, the orange node in the following figure .

chart 3 You need to calculate the schematic diagram of gradient nodes

pytorch Cannot add non leaf nodes requires_grad Set to False, After all, one of the above Y Terminating the gradient calculation seems strange and inelegant . If you want to stop right P1 Calculate the gradient , You can directly P1 Of requires_grad Set to False, And so on Y1,Y2 Of requires_grad Also for False.

3. detach()

If at this time P1 and P2 I want to use two loss How to update , For example, the generation of counter networks , Now you can put Y2 Conduct detach(), Here's the picture ：

chart 4 detach Sketch Map

detach() What is returned is a copy , This copy becomes a leaf node and is disconnected from the original computing graph , And a new calculation diagram can be constructed . In this way, through two loss It can be calculated separately P1,P2 The gradients of do not interfere with each other .

4. zero_grad()

If we feed data to the network twice , Here's the picture 5 Shown , The calculation diagram has another branch .

chart 5 Schematic diagram of two forward propagation

If we want to use it twice at this time loss Gradient update network , On the other hand loss1 and loss2 Back propagation respectively ,P1 and P2 The two gradients will be accumulated . It's equivalent to an increase in batch Size (loss1+loss2), Or update the twin network structure .

loss1.back_ward()
loss2.back_ward()
optimizer.step()
# Equivalent to 
(loss1+loss2).back_ward()

If we want to use it twice loss Update the weight parameters respectively , After each weight parameter update , Set the weight parameter gradient of the network to zero ; Otherwise, it will accumulate to the next loss In the gradient of , Influence weight update .

loss1.back_ward() // Calculation loss1 Gradient of  optimizer.step() // according to loss1 Gradient update weight 
optimizer.zero_grad() // Set the weight parameter of the network to zero  loss2.back_ward() // Calculation loss2 Gradient of 
optimizer.step() // according to loss2 Gradient update weight

5. retain_grad()

The ultimate goal of calculating the gradient is to update the weight parameters for the optimizer , That is, only P1 and P2 Gradient information is used , The gradient information of other non leaf nodes only plays the role of backward propagation , Throw it away after use. There is no need to keep it .

chart 6 Calculate the node of the gradient

In order to retain the gradient information of the node after the back propagation , Need to call retain_grad() To prevent release .

6. torch.nograd()

A context manager , Generated during context calculation Tensor No automatic derivation , It is equivalent to stripping from the calculation diagram . It is used to reduce the additional calculation and memory consumption related to back propagation during the evaluation stage .

Set all network weight parameters requires_grad Set up False The same effect can be achieved , however torch.nograd() More concise and elegant .

6. Optimizer

The optimizer updates the weight according to the gradient of the parameter of the weight , When constructing the optimizer, you need to register the weight parameters that need to be updated . The optimizer can perform gradient zeroing and weight updating operations on these weight parameters .

The above is in use pytorch Some insights and conclusions in , If there is any mistake, please correct it .

原网站

版权声明
本文为[blanklog]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/205/202207241827417413.html