当前位置：网站首页>Pytorch mixing accuracy principle and how to start this method

Pytorch mixing accuracy principle and how to start this method

2022-06-26 06:16:00 【Le0v1n】

1. Pre knowledge

stay PyTorch Of the default precision in Float32, namely 32 Bit floating point .

Use auto mix precision （Automatic Mixed Precision） The purpose of the model is to make the model ,Tensor The precision of is set to 16 instead of 32. because 32 For model learning , It's not necessary .

stay PyTorch1.6 Has built-in mixed precision package , as follows ：

from torch.cuda.amp import Scaler, autocast

Automatic mixing accuracy , That is to say torch.FloatTensor and torch.HalfTensor Mixing .

1.1 Thinking about data types

Think about a problem ： Why not use pure torch.FloatTensor Or pure torch.HalfTensor？

To answer this question , We first need to know what characteristics these two data types have , To be exact, what are their strengths and weaknesses .

torch.HalfTensor The advantage of Small storage 、 Fast calculation 、 Better use CUDA equipment . Therefore, the occupation of video memory can be reduced during training . Because the calculation is simple , Faster training . according to NVIDIA Official introduction , On some devices , The training speed can be doubled by using this precision model .
torch.HalfTensor The disadvantage is ：
- Small numerical range （ More easily Overflow / Underflow）-> Sometimes it leads to loss Turn into nan, So you can't train
- There is Rounding error （Rounding Error, As a result, some tiny gradient information cannot reach 16bit The lowest resolution of accuracy , And lose ）
torch.FloatTensor The advantage is that there is no torch.HalfTensor That shortcoming
torch.FloatTensor The drawback is that it takes up too much video memory , Training is slow

so , When there are advantageous scenes, try to use torch.HalfTensor It can speed up training .

1.2 eliminate `torch.HalfTensor` The disadvantages of the two schemes

In order to eliminate torch.HalfTensor The disadvantages of , There are usually two options .

1.2.1 programme 1：`torch.cuda.amp.GradScaler`

gradient scale（ The zoom ）, The tools used are torch.cuda.amp.GradScaler, adopt Zoom in loss To prevent the gradient underflow（ This is only used in gradient back propagation , stay optimizer When updating the weight, you still need to re unscale Go back ）

Gradient return -> scale -> Zoom in on the gradient
Update parameters -> Don't use scaled Gradient of

1.2.2 programme 2：`autocast()` Context manager or decorator for

Context manager -> with autocast():
Decorator -> @autocast()

Back to torch.FloatTensor, This is the origin of the word mix . How do you know when to use torch.FloatTensor, When to use half precision floating point ？ This is a PyTorch The framework determines , stay PyTorch 1.6 Of AMP Context （ Or decorator ） in , In the following operation tensor Will be automatically converted to half precision floating-point type torch.HalfTensor：

operation	explain
`__matmul__`	$\odot$
`addbmm`	Matrix of batches $\otimes$
`addmm`	`torch.addmm(input, mat1, mat2)`,mat1 and mat2 Perform matrix multiplication , Results and input Add up
`addmv`	`torch.addmv(input, mat, vec)` -> mat and vec perform $\odot$ , Then compare the results with input Add up
`addr`	`torch.addr(input, vec1, vec2)` -> vec1 $\otimes$ vec2 + input
`baddbmm`	`torch.baddbmm(input, batch1, batch2)` -> batch1 $\odot$ batch2 + input
`bmm`	`torch.bmm(input, mat2)` -> input $\odot$ mat2
`chain_matmul`	`torch.chain_matmul(*matrices)` -> return NN Matrix product of two-dimensional tensors . The product is effectively calculated using the matrix chain order algorithm , The algorithm selects the order that produces the lowest cost in arithmetic operations
`conv1d`	One dimensional convolution
`conv2d`	Two dimensional convolution
`conv3d`	Three dimensional convolution
`conv_transpose1d`	One dimensional transpose convolution
`conv_transpose2d`	Two dimensional transpose convolution
`conv_transpose3d`	Three dimensional transpose convolution
`linear`	Linear layer
`matmul`	`torch.matmul(input, other)` -> Two tensor perform $\odot$
`mm`	`torch.mm(input, mat2)` -> input $\otimes$ mat2
`mv`	`torch.mv(input, vec)` -> input $\odot$ vec
`prelu`	Self-learning ReLU Activation function （ReLU A variation of the ）

mm -> matrix & matrix
mv -> matrix & vector
incomprehension $\odot$ The meaning of symbols can be seen in this blog post ：Computer Vision Papers “ Circle plus ”、“ Circle multiplication ” and “ Point multiplication ” The explanation of
Transpose convolution recommended blog posts ： Transposition convolution （Transposed Convolution） Introduction and theoretical explanation of
PReLU If you don't understand, you can read the blog ： Activation function analysis is often used in deep learning

2. PyTorch How to use AMP（ Automatic mixing accuracy ）

To put it bluntly ：

autocast
GradScaler

Master the use of these two parts .

2.1 autocast

As mentioned above ,AMP Need to use torch.cuda.amp Module autocast class .

The following is a standard classification network training process （ The prediction phase is not included ）

#  establish model, The default is torch.FloatTensor
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), ...)

for epoch in range(1, args.epochs+1):  # epoch -> [1, epochs]
	if rank == 0:  #  Use in main process tqdm Yes dataloader For packaging 
		data_loader = tqdm(data_loader, file=sys.stdout)
		
	for step, inputs, labels in enumerate(data_loader):  #  iteration data_loader
	    optimizer.zero_grad()  #  First, clear the gradient residue in the optimizer 
	    pred = model(input.to(device))  #  Network forward propagation obtains prediction results 
	    loss = loss_fn(pred, labels.to(device))  #  Use loss Function calculates the predicted value and GT A direct gap , So we can figure out loss
	    
	    loss.backward()  #  After calculating loss And then back propagation 
	    
        if rank == 0:  #  Print training information in the main process 
            data_loader.desc = f"[train]epoch {
      epoch}/{
      opt.n_epochs} | lr: {
      optimizer.param_groups[0]['lr']:.4f} | mloss: {
      round(mean_loss.item(), 4):.4f}"

        #  Each card is judged by itself loss Limited data 
        if not torch.isfinite(loss):
            print(f"WARNING: non-finite loss, ending training! loss -> {
      loss}")
            sys.exit(1)  #  If loss Is infinite  ->  Quit training 

        #  Each card judges its own judgment loss Is it nan
        if torch.isnan(loss):
            print(f"WARNING: nan loss, ending training! loss -> {
      loss}")
            sys.exit(1)  #  If nan Is infinite  ->  Quit training 

        optimizer.step()  #  Finally, the optimizer updates the parameters

If you want to use AMP, It is also very simple , As shown below ：

from torch.cuda.amp import autocast

#  establish model, The default is torch.FloatTensor
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), ...)

for epoch in range(1, args.epochs+1):  # epoch -> [1, epochs]
	if rank == 0:  #  Use in main process tqdm Yes dataloader For packaging 
		data_loader = tqdm(data_loader, file=sys.stdout)
		
	for step, inputs, labels in enumerate(data_loader):  #  iteration data_loader
	    optimizer.zero_grad()  #  First, clear the gradient residue in the optimizer 
	    """  Only in forward reasoning and seeking loss Open when autocast that will do ！ """
	    with autocast():  #  establish autocast Context statement of 
		    pred = model(input.to(device))  #  Network forward propagation obtains prediction results 
		    loss = loss_fn(pred, labels.to(device))  #  Use loss Function calculates the predicted value and GT A direct gap , So we can figure out loss
	    
	    loss.backward()  #  After calculating loss And then back propagation 
	    
        if rank == 0:  #  Print training information in the main process 
            data_loader.desc = f"[train]epoch {
      epoch}/{
      opt.n_epochs} | lr: {
      optimizer.param_groups[0]['lr']:.4f} | mloss: {
      round(mean_loss.item(), 4):.4f}"

        #  Each card is judged by itself loss Limited data 
        if not torch.isfinite(loss):
            print(f"WARNING: non-finite loss, ending training! loss -> {
      loss}")
            sys.exit(1)  #  If loss Is infinite  ->  Quit training 

        #  Each card judges its own judgment loss Is it nan
        if torch.isnan(loss):
            print(f"WARNING: nan loss, ending training! loss -> {
      loss}")
            sys.exit(1)  #  If nan Is infinite  ->  Quit training 

        optimizer.step()  #  Finally, the optimizer updates the parameters

When entering autocast After the context of , The ones listed above CUDA operation （1.2.2 programme 2 The table in ） Will be able to tensor Of dtype Convert to torch.HalfTensor, So as to speed up the operation without losing the training accuracy .

Just entered autocast In the context of ,tensor It can be of any kind , Don't need to model perhaps inputs Manually call on .half(),PyTorch The framework will automatically help you complete , This is also in the automatic mixing accuracy Automatically The origin of the word .

Another point is ,autocast The context should only contain the forward process of the network （ Include loss The calculation of ）, Do not include back propagation , Because the operation of back propagation （operations） Will use the same type as the forward operation .

2.2 autocast Report errors

sometimes , Code in autocast Context The following errors will be reported in ：

Traceback (most recent call last):
......
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
......
RuntimeError: expected scalar type float but found c10::Half

In the model forward Add... To the function autocast() Decorator , With MobileNet v3 Small For example ：

from torch.cuda.amp import autocast


class MobileNetV3(nn.Module):
    def __init__(self, num_classes=27, sample_size=112, dropout=0.2, width_mult=1.0):
        super(MobileNetV3, self).__init__()
        input_channel = 16
        last_channel = 1024
		#  Various network definitions ...
		#  Various network definitions ...

    @autocast()
    def forward(self, x):  #  At this time MobileNet v3 Small The total forward, Add... To this function autocast() A decorator is fine 
        x = self.features(x)

        x = x.view(x.size(0), -1)

        x = self.classifier(x)
        return x

It should be noted that ：

Only in the final forward Add to the function , There is no need for other in-house forward Add... To the function autocast() The decorator

2.3 GradScaler

Don't forget the gradient mentioned earlier scaler Module , You need to instantiate a before the training starts GradScaler object . therefore PyTorch Classic AMP Use as follows ：

from torch.cuda.amp import autocast, GradScaler

#  Instantiate a before training begins GradScaler object 
scaler = GradScaler()

for epoch in range(1, args.epochs+1):  # epoch -> [1, epochs]
	if rank == 0:  #  Use in main process tqdm Yes dataloader For packaging 
		data_loader = tqdm(data_loader, file=sys.stdout)
		
	for step, inputs, labels in enumerate(data_loader):  #  iteration data_loader
	    optimizer.zero_grad()  #  First, clear the gradient residue in the optimizer （ unchanged ）
	    """  Only in forward reasoning and seeking loss Open when autocast that will do ！ """
	    with autocast():  #  establish autocast Context statement of 
		    pred = model(input.to(device))  #  Network forward propagation obtains prediction results 
		    loss = loss_fn(pred, labels.to(device))  #  Use loss Function calculates the predicted value and GT A direct gap , So we can figure out loss
        
        #  Use scaler First pair loss Zoom in , Then back propagate the amplified gradient 
        scaler.scale(loss).backward()  
	    
        if rank == 0:  #  Print training information in the main process 
            data_loader.desc = f"[train]epoch {
      epoch}/{
      opt.n_epochs} | lr: {
      optimizer.param_groups[0]['lr']:.4f} | mloss: {
      round(mean_loss.item(), 4):.4f}"

        #  Each card is judged by itself loss Limited data 
        if not torch.isfinite(loss):
            print(f"WARNING: non-finite loss, ending training! loss -> {
      loss}")
            sys.exit(1)  #  If loss Is infinite  ->  Quit training 

        #  Each card judges its own judgment loss Is it nan
        if torch.isnan(loss):
            print(f"WARNING: nan loss, ending training! loss -> {
      loss}")
            sys.exit(1)  #  If nan Is infinite  ->  Quit training 

        
        # scaler.step()  First, the value of the gradient unscale Come back .
        #  If the value of the gradient is not  infs  perhaps  NaNs,  So called optimizer.step() To update the weights ,
        #  otherwise , Ignore step call , So as to ensure that the weight is not updated （ Not destroyed ）
        """  This sentence is understood in this way ： 1.  First, scale the gradient back to its original value  2.  If the gradient value is not infs or nan, Then it will automatically call optimizer.step() To update the weights  3.  If the gradient value is infs or nan, Do not carry out optimizer.step() ->  Ensure that the weight is not broken （ This obviously wrong gradient will directly damage the weight ！）  and scaler.update() The meaning of this sentence is ： according to loss The situation makes scaler Dynamic adjustment of amplification factor  """
        scaler.step(optimizer)
        scaler.update()

scaler The size of is dynamically estimated in each iteration , In order to reduce the gradient as much as possible underflow,scaler It should be bigger ; But if it's too big , Half precision floating point type tensor It's easy overflow（ become inf perhaps nan）. So the principle of dynamic estimation is that there is no inf perhaps NaN Increase the gradient value as much as possible scaler Value —— In every time scaler.step(optimizer) in , Will check whether there is inf or NaN The gradient appears ：

If it does inf perhaps nan,scaler.step(optimizer) The weight update will be ignored （ Do not call optimizer.step() ), And will scaler The size of the is reduced （ Multiply backoff_factor）
If not inf perhaps nan, Then the weight is updated normally （ call optimizer.step() ), And when continuous multiple times （growth_interval Appoint ） There was no inf perhaps nan, be scaler.update() Will scaler Increase the size of （ Multiply growth_factor）

have access to PyTorch Project specifications to simplify development ：https://github.com/deepVAC/deepvac/.

3. matters needing attention

3.1 `loss` appear `inf` or `nan`

May not be used GradScaler -> Use it directly autocast() Context manager , then loss.backward() -> optimizer.step()
loss scale Time gradient occasionally overflow You can ignore , because amp The overflow condition will be detected and the update will be skipped （ If you customize optimizer.step The return value of , When overflow is found step The return value is always None）,scaler The magnification will be automatically reduced next time , If the update is stable for a long time ,scaler I will try to enlarge it again
Always show overflow and loss If it is very unstable, you need to reduce the learning rate properly （ Suggest 10 Double down ）, If loss It's still fluctuating , That may be the deep problem of the network .

3.2 Use AMP The rear speed becomes slower

The possible reasons are as follows ：

Conversion overhead between single precision and half precision , However, the cost of this part is relatively small , In contrast, the amount of subsequent calculation reduced by half precision can cover live
The numerical amplification and reduction when the gradient is returned , That is to add scaler It's going to slow down , This part of the expenses should be quite large , There are many parameter gradients that need to be returned , Plus multiplication and division , But if you don't add scaler, It is easy to appear when the gradient is returned underflow（16 bit The precision that can be expressed is limited , If the gradient value is too small, the information will be lost ）, So don't add scaler The final result may be worse . On the whole, this is a balance problem , Time for space .