当前位置:网站首页>Pytorch mixing accuracy principle and how to start this method
Pytorch mixing accuracy principle and how to start this method
2022-06-26 06:16:00 【Le0v1n】
1. Pre knowledge
stay PyTorch Of the default precision in Float32, namely 32 Bit floating point .
Use auto mix precision (Automatic Mixed Precision) The purpose of the model is to make the model ,Tensor The precision of is set to 16 instead of 32. because 32 For model learning , It's not necessary .
stay PyTorch1.6 Has built-in mixed precision package , as follows :
from torch.cuda.amp import Scaler, autocast
Automatic mixing accuracy , That is to say torch.FloatTensor and torch.HalfTensor Mixing .
1.1 Thinking about data types
Think about a problem : Why not use pure torch.FloatTensor Or pure torch.HalfTensor?
To answer this question , We first need to know what characteristics these two data types have , To be exact, what are their strengths and weaknesses .
torch.HalfTensorThe advantage of Small storage 、 Fast calculation 、 Better use CUDA equipment . Therefore, the occupation of video memory can be reduced during training . Because the calculation is simple , Faster training . according to NVIDIA Official introduction , On some devices , The training speed can be doubled by using this precision model .torch.HalfTensorThe disadvantage is :- Small numerical range ( More easily Overflow / Underflow)-> Sometimes it leads to
lossTurn intonan, So you can't train - There is Rounding error (Rounding Error, As a result, some tiny gradient information cannot reach 16bit The lowest resolution of accuracy , And lose )
- Small numerical range ( More easily Overflow / Underflow)-> Sometimes it leads to
torch.FloatTensorThe advantage is that there is notorch.HalfTensorThat shortcomingtorch.FloatTensorThe drawback is that it takes up too much video memory , Training is slow
so , When there are advantageous scenes, try to use torch.HalfTensor It can speed up training .
1.2 eliminate torch.HalfTensor The disadvantages of the two schemes
In order to eliminate torch.HalfTensor The disadvantages of , There are usually two options .
1.2.1 programme 1:torch.cuda.amp.GradScaler
gradient scale( The zoom ), The tools used are torch.cuda.amp.GradScaler, adopt Zoom in loss To prevent the gradient underflow( This is only used in gradient back propagation , stay optimizer When updating the weight, you still need to re unscale Go back )
- Gradient return -> scale -> Zoom in on the gradient
- Update parameters -> Don't use scaled Gradient of
1.2.2 programme 2:autocast() Context manager or decorator for
- Context manager ->
with autocast(): - Decorator ->
@autocast()
Back to torch.FloatTensor, This is the origin of the word mix . How do you know when to use torch.FloatTensor, When to use half precision floating point ? This is a PyTorch The framework determines , stay PyTorch 1.6 Of AMP Context ( Or decorator ) in , In the following operation tensor Will be automatically converted to half precision floating-point type torch.HalfTensor:
| operation | explain |
|---|---|
__matmul__ | ⊙ \odot ⊙ |
addbmm | Matrix of batches ⊗ \otimes ⊗ |
addmm | torch.addmm(input, mat1, mat2),mat1 and mat2 Perform matrix multiplication , Results and input Add up |
addmv | torch.addmv(input, mat, vec) -> mat and vec perform ⊙ \odot ⊙, Then compare the results with input Add up |
addr | torch.addr(input, vec1, vec2) -> vec1 ⊗ \otimes ⊗ vec2 + input |
baddbmm | torch.baddbmm(input, batch1, batch2) -> batch1 ⊙ \odot ⊙batch2 + input |
bmm | torch.bmm(input, mat2) -> input ⊙ \odot ⊙ mat2 |
chain_matmul | torch.chain_matmul(*matrices) -> return NN Matrix product of two-dimensional tensors . The product is effectively calculated using the matrix chain order algorithm , The algorithm selects the order that produces the lowest cost in arithmetic operations |
conv1d | One dimensional convolution |
conv2d | Two dimensional convolution |
conv3d | Three dimensional convolution |
conv_transpose1d | One dimensional transpose convolution |
conv_transpose2d | Two dimensional transpose convolution |
conv_transpose3d | Three dimensional transpose convolution |
linear | Linear layer |
matmul | torch.matmul(input, other) -> Two tensor perform ⊙ \odot ⊙ |
mm | torch.mm(input, mat2) -> input ⊗ \otimes ⊗ mat2 |
mv | torch.mv(input, vec) -> input ⊙ \odot ⊙ vec |
prelu | Self-learning ReLU Activation function (ReLU A variation of the ) |
- mm -> matrix & matrix
- mv -> matrix & vector
- incomprehension ⊙ \odot ⊙ The meaning of symbols can be seen in this blog post :Computer Vision Papers “ Circle plus ”、“ Circle multiplication ” and “ Point multiplication ” The explanation of
- Transpose convolution recommended blog posts : Transposition convolution (Transposed Convolution) Introduction and theoretical explanation of
- PReLU If you don't understand, you can read the blog : Activation function analysis is often used in deep learning
2. PyTorch How to use AMP( Automatic mixing accuracy )
To put it bluntly :
- autocast
- GradScaler
Master the use of these two parts .
2.1 autocast
As mentioned above ,AMP Need to use torch.cuda.amp Module autocast class .
The following is a standard classification network training process ( The prediction phase is not included )
# establish model, The default is torch.FloatTensor
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), ...)
for epoch in range(1, args.epochs+1): # epoch -> [1, epochs]
if rank == 0: # Use in main process tqdm Yes dataloader For packaging
data_loader = tqdm(data_loader, file=sys.stdout)
for step, inputs, labels in enumerate(data_loader): # iteration data_loader
optimizer.zero_grad() # First, clear the gradient residue in the optimizer
pred = model(input.to(device)) # Network forward propagation obtains prediction results
loss = loss_fn(pred, labels.to(device)) # Use loss Function calculates the predicted value and GT A direct gap , So we can figure out loss
loss.backward() # After calculating loss And then back propagation
if rank == 0: # Print training information in the main process
data_loader.desc = f"[train]epoch {
epoch}/{
opt.n_epochs} | lr: {
optimizer.param_groups[0]['lr']:.4f} | mloss: {
round(mean_loss.item(), 4):.4f}"
# Each card is judged by itself loss Limited data
if not torch.isfinite(loss):
print(f"WARNING: non-finite loss, ending training! loss -> {
loss}")
sys.exit(1) # If loss Is infinite -> Quit training
# Each card judges its own judgment loss Is it nan
if torch.isnan(loss):
print(f"WARNING: nan loss, ending training! loss -> {
loss}")
sys.exit(1) # If nan Is infinite -> Quit training
optimizer.step() # Finally, the optimizer updates the parameters
If you want to use AMP, It is also very simple , As shown below :
from torch.cuda.amp import autocast
# establish model, The default is torch.FloatTensor
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), ...)
for epoch in range(1, args.epochs+1): # epoch -> [1, epochs]
if rank == 0: # Use in main process tqdm Yes dataloader For packaging
data_loader = tqdm(data_loader, file=sys.stdout)
for step, inputs, labels in enumerate(data_loader): # iteration data_loader
optimizer.zero_grad() # First, clear the gradient residue in the optimizer
""" Only in forward reasoning and seeking loss Open when autocast that will do ! """
with autocast(): # establish autocast Context statement of
pred = model(input.to(device)) # Network forward propagation obtains prediction results
loss = loss_fn(pred, labels.to(device)) # Use loss Function calculates the predicted value and GT A direct gap , So we can figure out loss
loss.backward() # After calculating loss And then back propagation
if rank == 0: # Print training information in the main process
data_loader.desc = f"[train]epoch {
epoch}/{
opt.n_epochs} | lr: {
optimizer.param_groups[0]['lr']:.4f} | mloss: {
round(mean_loss.item(), 4):.4f}"
# Each card is judged by itself loss Limited data
if not torch.isfinite(loss):
print(f"WARNING: non-finite loss, ending training! loss -> {
loss}")
sys.exit(1) # If loss Is infinite -> Quit training
# Each card judges its own judgment loss Is it nan
if torch.isnan(loss):
print(f"WARNING: nan loss, ending training! loss -> {
loss}")
sys.exit(1) # If nan Is infinite -> Quit training
optimizer.step() # Finally, the optimizer updates the parameters
When entering autocast After the context of , The ones listed above CUDA operation (1.2.2 programme 2 The table in ) Will be able to tensor Of dtype Convert to torch.HalfTensor, So as to speed up the operation without losing the training accuracy .
Just entered autocast In the context of ,tensor It can be of any kind , Don't need to model perhaps inputs Manually call on .half(),PyTorch The framework will automatically help you complete , This is also in the automatic mixing accuracy Automatically The origin of the word .
Another point is ,autocast The context should only contain the forward process of the network ( Include loss The calculation of ), Do not include back propagation , Because the operation of back propagation (operations) Will use the same type as the forward operation .
2.2 autocast Report errors
sometimes , Code in autocast Context The following errors will be reported in :
Traceback (most recent call last):
......
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
......
RuntimeError: expected scalar type float but found c10::Half
In the model forward Add... To the function autocast() Decorator , With MobileNet v3 Small For example :
from torch.cuda.amp import autocast
class MobileNetV3(nn.Module):
def __init__(self, num_classes=27, sample_size=112, dropout=0.2, width_mult=1.0):
super(MobileNetV3, self).__init__()
input_channel = 16
last_channel = 1024
# Various network definitions ...
# Various network definitions ...
@autocast()
def forward(self, x): # At this time MobileNet v3 Small The total forward, Add... To this function autocast() A decorator is fine
x = self.features(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x
It should be noted that :
- Only in the final
forwardAdd to the function , There is no need for other in-houseforwardAdd... To the functionautocast()The decorator
2.3 GradScaler
Don't forget the gradient mentioned earlier scaler Module , You need to instantiate a before the training starts GradScaler object . therefore PyTorch Classic AMP Use as follows :
from torch.cuda.amp import autocast, GradScaler
# Instantiate a before training begins GradScaler object
scaler = GradScaler()
for epoch in range(1, args.epochs+1): # epoch -> [1, epochs]
if rank == 0: # Use in main process tqdm Yes dataloader For packaging
data_loader = tqdm(data_loader, file=sys.stdout)
for step, inputs, labels in enumerate(data_loader): # iteration data_loader
optimizer.zero_grad() # First, clear the gradient residue in the optimizer ( unchanged )
""" Only in forward reasoning and seeking loss Open when autocast that will do ! """
with autocast(): # establish autocast Context statement of
pred = model(input.to(device)) # Network forward propagation obtains prediction results
loss = loss_fn(pred, labels.to(device)) # Use loss Function calculates the predicted value and GT A direct gap , So we can figure out loss
# Use scaler First pair loss Zoom in , Then back propagate the amplified gradient
scaler.scale(loss).backward()
if rank == 0: # Print training information in the main process
data_loader.desc = f"[train]epoch {
epoch}/{
opt.n_epochs} | lr: {
optimizer.param_groups[0]['lr']:.4f} | mloss: {
round(mean_loss.item(), 4):.4f}"
# Each card is judged by itself loss Limited data
if not torch.isfinite(loss):
print(f"WARNING: non-finite loss, ending training! loss -> {
loss}")
sys.exit(1) # If loss Is infinite -> Quit training
# Each card judges its own judgment loss Is it nan
if torch.isnan(loss):
print(f"WARNING: nan loss, ending training! loss -> {
loss}")
sys.exit(1) # If nan Is infinite -> Quit training
# scaler.step() First, the value of the gradient unscale Come back .
# If the value of the gradient is not infs perhaps NaNs, So called optimizer.step() To update the weights ,
# otherwise , Ignore step call , So as to ensure that the weight is not updated ( Not destroyed )
""" This sentence is understood in this way : 1. First, scale the gradient back to its original value 2. If the gradient value is not infs or nan, Then it will automatically call optimizer.step() To update the weights 3. If the gradient value is infs or nan, Do not carry out optimizer.step() -> Ensure that the weight is not broken ( This obviously wrong gradient will directly damage the weight !) and scaler.update() The meaning of this sentence is : according to loss The situation makes scaler Dynamic adjustment of amplification factor """
scaler.step(optimizer)
scaler.update()
scaler The size of is dynamically estimated in each iteration , In order to reduce the gradient as much as possible underflow,scaler It should be bigger ; But if it's too big , Half precision floating point type tensor It's easy overflow( become inf perhaps nan). So the principle of dynamic estimation is that there is no inf perhaps NaN Increase the gradient value as much as possible scaler Value —— In every time scaler.step(optimizer) in , Will check whether there is inf or NaN The gradient appears :
- If it does
infperhapsnan,scaler.step(optimizer)The weight update will be ignored ( Do not calloptimizer.step()), And willscalerThe size of the is reduced ( Multiplybackoff_factor) - If not
infperhapsnan, Then the weight is updated normally ( calloptimizer.step()), And when continuous multiple times (growth_intervalAppoint ) There was noinfperhapsnan, bescaler.update()WillscalerIncrease the size of ( Multiplygrowth_factor)
have access to PyTorch Project specifications to simplify development :https://github.com/deepVAC/deepvac/.
3. matters needing attention
3.1 loss appear inf or nan
- May not be used
GradScaler-> Use it directlyautocast()Context manager , thenloss.backward()->optimizer.step() loss scaleTime gradient occasionally overflow You can ignore , becauseampThe overflow condition will be detected and the update will be skipped ( If you customizeoptimizer.stepThe return value of , When overflow is foundstepThe return value is alwaysNone),scalerThe magnification will be automatically reduced next time , If the update is stable for a long time ,scalerI will try to enlarge it again- Always show overflow and
lossIf it is very unstable, you need to reduce the learning rate properly ( Suggest 10 Double down ), IflossIt's still fluctuating , That may be the deep problem of the network .
3.2 Use AMP The rear speed becomes slower
The possible reasons are as follows :
- Conversion overhead between single precision and half precision , However, the cost of this part is relatively small , In contrast, the amount of subsequent calculation reduced by half precision can cover live
- The numerical amplification and reduction when the gradient is returned , That is to add scaler It's going to slow down , This part of the expenses should be quite large , There are many parameter gradients that need to be returned , Plus multiplication and division , But if you don't add scaler, It is easy to appear when the gradient is returned underflow(
16 bitThe precision that can be expressed is limited , If the gradient value is too small, the information will be lost ), So don't addscalerThe final result may be worse . On the whole, this is a balance problem , Time for space .
3.3 Recommended articles
- With AlexNet As a template ,DP、DDP Use amp and GradScaler Speed measurement
- Pytorch Automatic mixing accuracy (AMP) Training
- PyTorch Distributed training foundation –DDP Use
Reference resources
- https://zhuanlan.zhihu.com/p/165152789
- https://blog.csdn.net/weixin_44878336/article/details/124501040
- https://blog.csdn.net/weixin_44878336/article/details/124754484
- https://blog.csdn.net/weixin_44878336/article/details/125119242
- https://zhuanlan.zhihu.com/p/516996892
边栏推荐
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
- The interviewer with ByteDance threw me an interview question and said that if I could answer it, other companies would have an 80% chance of passing the technical level
- 数据治理工作的几种推进套路
- 获取当前月份的第一天和最后一天,上个月的第一天和最后一天
- TCP connection and disconnection, detailed explanation of state transition diagram
- Multi thread synchronous downloading of network pictures
- [spark] how to implement spark SQL field blood relationship
- Logstash——Logstash将数据推送至Redis
- 05. basic data type - Dict
- Bubble sort
猜你喜欢

Household accounting procedures (First Edition)

Implementation of third-party wechat authorized login for applet

MySQL-09

Five solutions across domains

Cython入门

Hot! 11 popular open source Devops tools in 2021!

Message queue - message transaction management comparison

EFK升级到ClickHouse的日志存储实战

C generic speed

数据可视化实战:实验报告
随机推荐
Logstash - logstash sends an alarm email to email
01 golang and matlab code of knapsack problem
Gof23 - abstract factory pattern
MySQL-09
numpy. log
Use the fast proxy to build your own proxy pool (mom doesn't have to worry about IP being blocked anymore)
numpy.log
TCP连接与断开,状态迁移图详解
Basic construction of SSM framework
Comparison between Prometheus and ZABBIX
302. minimum rectangular BFS with all black pixels
Five solutions across domains
06. talk about the difference and coding between -is and = = again
Household accounting procedures (First Edition)
在web页面播放rtsp流视频(webrtc)
Evolution history of qunar Bi platform construction
Hot! 11 popular open source Devops tools in 2021!
卷妹带你学jdbc---2天冲刺Day2
Logstash——使用throttle过滤器向钉钉发送预警消息
Implement the runnable interface