当前位置:网站首页>Pytoch deep learning code skills
Pytoch deep learning code skills
2022-06-26 14:58:00 【Eight dental shoes】
There are many pitfalls in building a deep learning network , I also see a lot of skills when reading other people's code , Make a unified record , It's also convenient for you to check
Parameter configuration
Argparser library
Argparser Kuo is python A library of its own , Use Argparser Can make us feel like Linux The system uses the command line to set parameters , Generated parse_args Object to package all the parameters , It is very convenient to pass modified parameters in multiple files
import argparse
parser = argparse.ArgumentParser(description='MODELname') # Give the parameter parser a name
register_data_args(parser)
parser.add_argument("--dropout", type=float, default=0.5,
help="dropout probability")
parser.add_argument("--gpu", type=int, default=-1,
help="gpu")
parser.add_argument("--lr", type=float, default=1e-2,
help="learning rate")
parser.add_argument("--n-epochs", type=int, default=200,
help="number of training epochs")
parser.add_argument("--n-hidden", type=int, default=16,
help="number of hidden gcn units")
parser.add_argument("--n-layers", type=int, default=1,
help="number of hidden gcn layers")
parser.add_argument("--weight-decay", type=float, default=5e-4,
help="Weight for L2 loss")
parser.add_argument("--aggregator-type", type=str, default="gcn",
help="Aggregator type: mean/gcn/pool/lstm")
config = parser.parse_args()
# And then you can use config.xxx Represents each parameter
# debugging , Input... At the terminal python train.py --n-epoch 100 --lr 1e-3 ....
Model framework
Dataloader
- To put load data The most important part is in getitem Function below , The body of a class only records train data The path of , In this way, you can use some adjustment during training , Will not result in too high CPU Memory footprint
- len() When rewriting a function, you must pay attention to the and getitem Match the amount of data in
class TrainDataset(Dataset):
def __init__(self,listdir=list_dir):
super(TrainDataset, self).__init__()
self.train_dirs = []
for dir in listdir:
self.train_dirs.append(dir)
...
def __getitem__(self, index):
path = self.train_dirs[index]
data = np.load(path)
...
def __len__(self):
return len(self.train_dirs)
- collate_fn
In the build Dataloader Object can be set collate_fn Parameters , Incoming data processing functions , Need to write by yourself .
trainloader = Dataloader(dataset = train_dataset,shuffle=True,collate_fn=my_func)
collate_fn The function of is to customize the data acquisition method
Learning rate
- lr_scheduler
torch.optim.lr_scheduler The module provides some basis epoch Training times to adjust learning rate (learning rate) Methods . In general, we will set the following epoch And gradually reduce the learning rate so as to achieve better training results .
Common learning rate adjustment strategies are :
StepLR: Adjust the learning rate at equal intervals , Each adjustment is lr*gamma, Adjust the interval to step_size
scheduler = torch.optim.lr_scheduler.StepLR(optimizer,step_size,gamma=0.1,last_epoch=-1,verbose=False)
Parameters :
1、optimizer: Optimizer for setup
2、step_size: Learning rate adjustment step size , Each pass step_size Updated once
3、gamma: Learning rate adjustment multiple
4、last_epoch:last_epoch Then recover lr by initial_lr( If you train a lot epoch After that Keep training This value is equal to the... Of the loaded model epoch The default is -1 Means to train from scratch , From epoch=1 Start
5、verbose: Whether to output once for each change lr Value
MultiStepLR : At present epoch When the number meets the set value , Adjust the learning rate . This method is suitable for later debugging , Observe loss curve , Set a learning rate adjustment period for each experiment
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1, verbose=False)
milestones: A contain epoch Indexed list, Each index in the list represents the time to adjust the learning rate epoch.list The value in must be incremented . Such as [20, 50, 100] It means that epoch by 20,50,100 Adjust the learning rate .
Other parameter setting methods are the same
ExponentialLR: Adjust the learning rate by exponential decay , Adjustment formula :lr = lr*gamma**epoch
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1, verbose=False)
CosineAnnealingLR: Cosine annealing strategy , Periodically adjust the learning rate
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1, verbose=False)
Parameters :1、T_max : Learning rate adjustment cycle , Adjust the learning rate back to the initial value every cycle , Decay again , This strategy helps to jump out of the saddle . Usually set to len(train_dataset)
2、eta_min: Minimum learning rate attenuated ,defult by 0
3、last_epoch: the previous epoch Count , This variable is used to indicate whether the learning rate needs to be adjusted . When last_epoch The learning rate will be adjusted when the set interval is met . When set to -1 when , Set the learning rate to the initial value
At every train step Update learning rate in :
scheduler.step()
- warmup
warmup Is a learning rate optimization method , First appeared in resnet In the paper , At the beginning of model training, select a small learning rate , After a period of training (10epoch perhaps 10000steps) Use the preset learning rate for training
Use warmup Why :
At the beginning of model training , Weight randomization , The understanding of data is 0, At the first epoch in , The model will quickly adjust parameters according to the input data , At this time, if a large learning rate is adopted , There is a great possibility that the model will deviate , It takes more rounds to pull back
After a period of model training , Have some prior knowledge of data , At this time, it is not easy to learn biases by using a larger learning rate model , You can use a higher learning rate to speed up your training .
The model uses a large learning rate to train for a period of time , The distribution of the model is relatively stable , It is not appropriate to learn new features from the data , If we continue to use a large learning rate, it will destroy the stability of the model , And using a smaller learning rate is more optimal .
warm_up Realization
class WarmupLR(_LRScheduler):
"""The WarmupLR scheduler This scheduler is almost same as NoamLR Scheduler except for following difference: NoamLR: lr = optimizer.lr * model_size ** -0.5 * min(step ** -0.5, step * warmup_step ** -1.5) WarmupLR: lr = optimizer.lr * warmup_step ** 0.5 * min(step ** -0.5, step * warmup_step ** -1.5) Note that the maximum lr equals to optimizer.lr in this scheduler. """
def __init__(
self,
optimizer: torch.optim.Optimizer,
warmup_steps: Union[int, float] = 25000,
last_epoch: int = -1,
):
assert check_argument_types()
self.warmup_steps = warmup_steps
# __init__() must be invoked before setting field
# because step() is also invoked in __init__()
super().__init__(optimizer, last_epoch)
def __repr__(self):
return f"{
self.__class__.__name__}(warmup_steps={
self.warmup_steps})"
def get_lr(self):
step_num = self.last_epoch + 1
return [
lr
* self.warmup_steps ** 0.5
* min(step_num ** -0.5, step_num * self.warmup_steps ** -1.5)
for lr in self.base_lrs
]
def set_step(self, step: int):
self.last_epoch = step
warm_up coordination lr_scheduler Use it together , First, increase the learning rate linearly , Decay again :
if step_counter>10:
scheduler.step()
model training
Distributed
- nn.DataParallel
adopt torch.nn.DataParallel Distributed training , More than one piece is required on the host GPU, Allocate data to multiple blocks during training GPU Multi process training on , each GPU On the Internet optimize, then loss Summarize and average , Broadcast the back-propagation gradient to each block GPU On
model= MY_MODEL()
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model)
Be careful not to use when opening the distributed system next(model.parameters()), And there are... In the model the foriegner model when , You can't give these... Inside the model foriegner model Specify the equipment number (cuda:0), It can be used torch.device('cuda') Instead of , Let the program allocate itself , So is the data , Otherwise, the data of model operation will be placed in different places GPU The situation of
- Look at the model 、 The device where the data resides
# data type:torch.tensor
print(next(model.parameters()).device)
print(data.device)
Determine whether the data is gpu On :
print(data.is_cuda)
The data is in gpu、cpu Mutual conversion between :
# The simplest ,data type:torch.tensor
data.cpu()
data.cuda()
# Set up the device
device = torch.device('cuda:0' if torch.cuda_is_avaliable() else 'cpu')
data.to(device)
- local rank
stay pytorch Distributed training on , Appoint local rank, In process GPU Number , Non explicit parameters , from torch.distributed.launch Internal designation . host GPU Of local_rank by 0. For example , rank = 3,local_rank = 0 It means the first one 3 The first in a process 1 block GPU Host process local rank=0
In each iteration , Each process has its own optimizer , And independently complete all the optimization steps , In process training is the same as general training .
After the gradient calculation of each process is completed , Each process needs to aggregate and average the gradients , And then by rank=0 The process of , Put it broadcast To all processes . after , Processes use this gradient to update parameters .
Set which block GPU Work
torch.cuda.set_device(arg.local_rank)
device = torch.device('cuda',arg.local_rank)
Debugging tips 、 Data visualization
- Set button , Check if the data set has been downloaded , If something goes wrong raise RuntimeError
if not self.check_integrity():
raise RuntimeError('Dataset not found or corrupted.' +
' You need to download it from official website.')
# Check if the path to the dataset exists
def check_integrity(self):
if not os.path.exists(self.root_dir):
return False
else:
return True
- use enumerate And list derivation (for expression ) Generate word2idx,labels2idx, such idx You don't have to set variables separately for loop +=1 To set up
self.label2index = {
label: index for index, label in enumerate(sorted(set(labels)))}
- Record the total number of parameters of the model
# pytorch Medium numel Function Statistics tensor The total amount of elements in
num_params = sum(p.numel() for p in model.parameters())
print(num_params)
Training log
- logging modular
use logging The module records the training log , Write log files online in the background , It can also be output online , Easy to monitor training information
1 import logging
2
# File storage address , file name 、 Information level
3 logging.basicConfig(filename=os.path.join(self.writer.log_dir, 'training.log'),level=logging.DEBUG, format='%(asctime)s - %(name)s - %(message)s')
4
5 logging.debug('this is debug message') # This information will be recorded in training.log in
6 logging.info('this is info message')
7 logging.warning('this is warning message')
8
9 ''''' 10 result : 11 2017-08-23 14:22:25,713 - root - this is debug message 12 2017-08-23 14:22:25,713 - root - this is info message 13 2017-08-23 14:22:25,714 - root - this is warning message 14 '''
logging.basicConfig Function parameters :
filename: Specify the log file name
filemode: and file Functions have the same meaning , Specify the opening mode of the log file ,‘w’ or ’a’
format: Specify the format and content of the output ,format Can output a lot of useful information , As shown in the example above :
%(levelno)s: Print log level values
%(levelname)s: Print log level name
%(pathname)s: Print the path of the currently executing program , In fact, that is sys.argv[0]
%(filename)s: Print the name of the currently executing program
%(funcName)s: Print the current function of the log
%(lineno)d: Print the current line number of the log
%(asctime)s: Time to print the log
%(thread)d: Print thread ID
%(threadName)s: Print thread name
%(process)d: Printing process ID
%(message)s: Print log information
datefmt: Specify the time format , Same as time.strftime()
level: Set the log level , The default is logging.WARNING
stream: Specifies the output stream that will log , Output to can be specified sys.stderr,sys.stdout Or documents , Default output to sys.stderr, When stream and filename When appointed at the same time ,stream Be ignored
- Output the log to both the file and the screen :
# Set log name , Usually named after the master file
logger1 = logging.getLogger(__name__)
logging.basicConfig(filename=os.path.join(self.writer.log_dir, 'training.log'),level=logging.DEBUG)
logging.info(f"Start SimCLR training for {
self.args.epochs} epochs.")
logging.info(f"Training with gpu: {
self.args.disable_cuda}.")
....
# Log output to file
fh1 = logging.FileHandler(filename='a1.log', encoding='utf-8') # file a1
fh2 = logging.FileHandler(filename='a2.log', encoding='utf-8') # file a2
sh = logging.StreamHandler() # Log output to terminal slice
边栏推荐
- Mathematical modeling of war preparation 30 regression analysis 2
- It's natural for the landlord to take the rent to repay the mortgage
- Talk about the RPA direction planning: stick to simple and valuable things for a long time
- Stream常用操作以及原理探索
- Numpy basic use
- Is it safe to open a stock account with the account manager online??
- The annual salary of 500000 is one line, and the annual salary of 1million is another line
- Naacl2022: (code practice) good visual guidance promotes better feature extraction, multimodal named entity recognition (with source code download)
- SAP 销售数据 实际发货数据导出 销量
- Optimizing for vectorization
猜你喜欢
杜老师说网站更新图解

The JVM outputs GC logs, causing the JVM to get stuck. I am stupid

qt下多个子控件信号槽绑定方法

Mathematical modeling of war preparation 30 regression analysis 2

Deploy the flask environment using the pagoda panel
![[cloud native] codeless IVX editor programmable by](/img/10/7c56e46df69be6be522a477b00ec05.png)
[cloud native] codeless IVX editor programmable by "everyone"

TCP拥塞控制详解 | 1. 概述

【soloπ】adb连接单个多个手机

这才是优美的文件系统挂载方式,亲测有效

Attention meets Geometry:几何引导的时空注意一致性自监督单目深度估计
随机推荐
Get the intersection union difference set of two dataframes
Authoritative announcement on the recruitment of teachers in Yan'an University in 2022
网上找客户经理办理股票开户安全吗??
信息学奥赛一本通 1405:质数的和与积 (思维题)
Complete diagram / Euler loop
工作上对金额价格类小数点的总结以及坑
Idea shortcut key
一篇抄十篇,CVPR Oral被指大量抄袭,大会最后一天曝光!
It's natural for the landlord to take the rent to repay the mortgage
The heavyweight white paper was released. Huawei continues to lead the new model of smart park construction in the future
15 bs对象.节点名称.节点名称.string 获取嵌套节点内容
feil_uVission4左侧工目录消失
杜老师说网站更新图解
【使用yarn运行报错】The engine “node“ is incompatible with this module.
TCP拥塞控制详解 | 1. 概述
View touch analysis
The JVM outputs GC logs, causing the JVM to get stuck. I am stupid
Numpy basic use
Unity 利用Skybox Panoramic着色器制作全景图预览有条缝隙问题解决办法
Declaration and assignment of go variables