当前位置:网站首页>Reinforcement learning weekly (issue 50): saferl kit, gmi-drl, rp-sdrl & offline meta reinforcement learning

Reinforcement learning weekly (issue 50): saferl kit, gmi-drl, rp-sdrl & offline meta reinforcement learning

2022-06-22 22:22:00 Zhiyuan community

About weekly :

Reinforcement learning is one of the research hotspots in the field of artificial intelligence , Its research progress and achievements have also attracted a lot of attention . In order to help researchers and engineers understand the relevant progress and information in this field , Zhiyuan community combines the contents of the field , Written as chapter 50 period 《 Intensive learning weekly 》. This issue of the weekly has sorted out Reinforcement learning area Relevant latest paper recommendation and research review , For all of you .

 

Weekly adopts the mode of community cooperation , Welcome interested friends to participate in our work , Let's promote the sharing of reinforcement learning community 、 Learning and communication activities . You can scan the QR code at the end of the article to join the reinforcement learning community .

 

Contributors to this issue :( Li Ming , Liu Qing 、 Little fat )

 

About weekly subscriptions :

Here's good news ,《 Intensive learning weekly 》 Turn on “ Subscribe to the function ”, In the future, we will automatically push you the latest version of 《 Intensive learning weekly 》. Subscription method :

1, Register Zhiyuan community account

2, Click the author column in the upper left corner of the weekly interface “ Intensive learning weekly ”( Here's the picture ), Get into “ Intensive learning weekly ” Home page .

 
 

3, Click on “ Focus on TA”( Here's the picture )

 

4, You have completed 《 Intensive learning weekly 》 Subscribe , In the future, Zhiyuan community will automatically push you the latest version of 《 Intensive learning weekly 》!

 

Paper recommendation

Recommended this time 14 A related paper in the field of reinforcement learning , It mainly introduces Based on semi centralized logic MARL The reward formation method is extended to Multi-Agent Reinforcement Learning 、 A two handed dexterous hand benchmark is proposed (Bi-Dexthands) Simulator to achieve human level two handed dexterity 、 Evaluate safe automatic driving by precise penalty optimization method 、 Put forward a kind of Bootstrapped Transformer The new algorithm combines bootstrapping idea to drive offline RL Training 、 Recommendation can be explained by strengthening knowledge perception reasoning (MBKR) Combine micro behavior with KG Combining reinforcement learning with interpretable recommendation to study User Microscopic behavior 、 This paper introduces the joint depth based on blockchain actor-critic Task offload algorithm to solve the problem of secure and low latency computing offload 、 It introduces GCRN Architecture combined with graph convolution network (GCN) To capture spatial dependencies and bi-directional gated loop units (Bi-GRU) To solve the time dependency etc. .

 

title :Fast Population-Based Reinforcement Learning on a Single Machine(InstaDeep Ltd:Arthur Flajolet | Group based rapid reinforcement learning on a single machine )

brief introduction : Training agent group shows great potential in reinforcement learning , Can train stably 、 Improve exploration and asymptotic performance , And generate diverse solutions . However , Practitioners often do not consider crowd based training , Because it is thought to be either too slow ( Implement in sequence ), Or the calculation cost is high ( If the agent trains in parallel on a separate Accelerator ). This article compares implementations and reviews previous research , To show that the judicious use of compilation and vectorization allows group based training to be performed on a single machine with an accelerator , Compared to training a single agent , The cost is minimal . Research also shows that , When providing a small number of accelerators , The protocol is extended to large groups of applications such as super parameter adjustment . The researchers hope that the public release of the research and code will encourage practitioners to use population-based learning more frequently for the same research and application .

Thesis link :https://arxiv.org/pdf/2206.08888.pdf

read more

 

title :Logic-based Reward Shaping for Multi-Agent Reinforcement Learning( University of Virginia :Ingy ElSayed-Aly | Logic based reward formation in Multi-Agent Reinforcement Learning )

brief introduction : Reinforcement learning (RL) Relies heavily on exploration to learn from the environment and maximize observed rewards . therefore , Must design a reward function , To ensure the best learning from the experience received . Previous studies have combined reward shaping based on automata and logic with environmental assumptions , To provide an automatic mechanism to synthesize reward functions according to tasks . However , How to extend logic based reward formation to Multi-Agent Reinforcement Learning (MARL) The work in this area is still very limited . If the task requires cooperation , The environment will need to consider Federated States to track other agents , Thus suffering from the dimension disaster related to the number of agents . This project explores how to design logic based for different scenarios and tasks MARL Reward formation . This paper presents a novel semi - centralized logic - based MARL Reward formation method , The method is scalable in the number of agents , And evaluate it in multiple scenarios .

Thesis link :https://arxiv.org/pdf/2206.08881.pdf

read more

 

title :Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning( Peking University, :Yaodong Yang | Based on reinforcement learning to achieve human level two handed dexterity )

brief introduction : Achieving human dexterity is an important open problem in robotics . In this paper, a two handed dexterous hand benchmark is proposed (Bi-Dexthands), A simulator , It includes two dexterous hands, dozens of two handed operation tasks and thousands of target objects . According to research , The two handed task is designed to match different levels of human motor skills . In this paper Issac Gym Built Bi-DexHands; This enables efficient RL Training , Use only one NVIDIA RTX 3090 You can achieve 30000 Multiframe / second . For popular under different settings RL The algorithm provides a comprehensive benchmark ; This includes orders / Multi agent RL、 offline RL、 multitasking RL Heyuan RL. The results show that ,PPO This type of strategy algorithm can be mastered equivalent to 48 Simple operation tasks for human infants at the age of months ( for example , Capture flying objects 、 Open the bottle ), And multi-agent RL It can further help to master the operation that requires skilled cooperation with both hands ( for example , pot 、 Stack blocks ). Despite success in every task , But when it comes to acquiring multiple operational skills , The existing RL The algorithm can not work in most multi task and a few snapshot learning environments , This needs to be RL More substantial development of the community .

Thesis link :https://arxiv.org/pdf/2206.08686.pdf

read more

 

title :SafeRL-Kit: Evaluating Efficient Reinforcement Learning Methods for Safe Autonomous Driving( Shenzhen Research Institute of Tsinghua University & JD.COM :Xueqian Wang&Li Shen | SafeRL-Kit: An efficient reinforcement learning method for evaluating safe autonomous driving )

brief introduction : Safety reinforcement learning (RL) Significant success in risk sensitive tasks , On autopilot (AD) It also shows a good prospect . Considering the uniqueness of this group , Security AD There is still a lack of effective and repeatable baselines . This article published SafeRL-Kit, Face to face AD Mission safety RL Method for benchmarking . namely SafeRL-Kit Contains several new algorithms for zero constraint violation tasks , Including the security layer 、 recovery RL、 Lagrangian method of off-line strategy and its feasibility Actor-Critic. In addition to existing methods , A new first-order method is also proposed , Known as exact penalty optimization (EPO), And it is fully proved that it is safe AD Ability in .SafeRL-Kit All the algorithms in are in (i)off-policy Implemented under settings , This improves sample efficiency , And make better use of past logs ;(ii) Have a unified learning framework , Provide ready-made interfaces for researchers , Incorporate their domain specific knowledge into basic security RL Method . Last , stay SafeRL-Kit The above algorithms are compared and evaluated , Their effectiveness in safe autonomous driving is illustrated .

Thesis link :https://arxiv.org/pdf/2206.08528.pdf

read more

 

title :GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing( University of California, Santa Barbara :Yuke Wang | GMI-DRL: adopt GPU Spatial reuse is enhanced GPU Deep reinforcement learning )

brief introduction : With the increasing popularity of robot technology in the field of industrial control and automatic driving , Deep reinforcement learning (DRL) It has attracted the attention of various fields . However , Due to its heterogeneous workloads and staggered execution patterns , In the existing powerful GPU On the platform DRL Computing is still inefficient . So , This paper proposes GMI-DRL, adopt GPU Spatial reuse is accelerated GPU DRL System design . Based on a new resource tunable GPU Reuse instances (GMI) Design , To satisfy DRL The actual needs of the task , An adaptive GMI Management strategy , To achieve high GPU Utilization and computing throughput , And an efficient GMI Inter communication support , To satisfy all kinds of DRL Requirements for communication mode . Comprehensive experiments show that ,GMI-DRL In the latest DGX-A100 The training throughput on the platform is better than the most advanced NVIDIA Isaac Gym,NCCL( the height is 2.81 times ) and Horovod( the height is 2.34 times ) Support . The study passes GPU Spatial multiplexing handles heterogeneous workloads that mix computing and communication to provide an initial user experience .

Thesis link :https://arxiv.org/pdf/2206.08482.pdf

read more

 

title :Bootstrapped Transformer for Offline Reinforcement Learning( Shanghai Jiaotong University :Kerong Wang | Bootstrap based on off-line reinforcement learning Transformer)

brief introduction : Offline reinforcement learning (RL) To learn strategies from previously collected static trajectory data , Without having to interact with the real world . The existing research provides a novel perspective , Will be offline RL As a general sequence generation problem , use Transformer Structure and other sequential models to model the distribution on the trajectory , And the beam search is used as the planning algorithm again . However , Generally offline RL The training data set used in the task is very limited , And there is often the problem of insufficient distribution coverage , This may be harmful to the training sequence generation model , However, it has not attracted enough attention in previous studies . This paper puts forward a kind of Bootstrapped Transformer The new algorithm of , It combines the idea of bootstrapping , Use the learned model to generate more offline data , Further promote the training of sequence model . By taking two offline RL A large number of experiments on the benchmark have proved that the model can make up for the existing offline RL Limitations of training , And beat other powerful baseline methods . The generated pseudo data is also analyzed , The revealed features may be useful for offline RL Training provides some inspiration .

Thesis link :https://arxiv.org/pdf/2206.08569.pdf

read more

 

title :Micro-behaviour with Reinforcement Knowledge-aware Reasoning for Explainable Recommendation( Donghua University : Shaohua Tao| Research on interpretable recommendation micro behavior with enhanced knowledge perception reasoning )

brief introduction : The existing The recommendation method has integrated project knowledge into the micro behavior of user project interaction . Even though Such kind The method proved to be effective , But two ideas are often overlooked . First , not Map micro behavior to knowledge (KG) The relationship between , not Capture the semantic relationship between micro behaviors and relationships . secondly , not Provide explicit reasoning for micro behavior from user project interaction data . These insights prompted this paper to propose a novel micro behavior model , That is to say, knowledge aware reasoning can explain recommendation (MBKR) , The model combines micro behavior with KG Combined with reinforcement learning to make interpretable recommendations . Its Propagate and... Through user items KG Relationship to learn user behavior , And combine the two to calculate the behavior intensity of mining user interest . In addition, it also designs Shawo Relationship path , Combine recommendation and interpretability by providing a reasonable path ; These paths capture the semantics of behavior and relationships . Finally, this method is widely evaluated on several large benchmark data sets .

Thesis link :https://www.sciencedirect.com/science/article/pii/S0950705122006529

read more

 

title :Neural H₂ Control Using Continuous-Time Reinforcement Learning(CINVESTAV-IPN: Adolfo Perrusquia| Neural network based on continuous time reinforcement learning H2 control )

brief introduction : In this paper, the continuous time of unknown nonlinear systems is discussed H2 control . Its The differential neural network is used to model the system , Then the neural model-based H2 Tracking control . Because of nerves H2 The control is very sensitive to neural modeling errors , so Use reinforcement learning to improve control performance . The neural modeling and H2 Stability of tracking control , The convergence of the method is given . The effectiveness of this method is verified by two benchmark control problems .

Thesis link :https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9269440

read more

 

title :Residual Physics and Post-Posed Shielding for Safe Deep Reinforcement Learning Method( National University of Singapore : Qingang Zhang| Residual physics and post shielding of secure deep reinforcement learning method )

brief introduction : Deep reinforcement learning (DRL) Targeted at the data center (DC) The control problem of air conditioning unit in computer room is studied . but Two main problems limit DRL Deployment in the actual system . First , A lot of data is needed . secondly , As a mission critical system , Need to ensure safety control , also DC The temperature should be kept within a certain range . by this , This paper presents a novel control method RP-SDRL. Its First, we use the residual physics constructed by the first law of thermodynamics DRL Combination of algorithm and prediction model . And then , A correction model adapted from gradient descent is combined with a prediction model as a post shield , To enforce safe operation . The simulation results show that RP-SDRL Method . Add noise to the state of the model , To further test its performance under state uncertainty . Experimental results show that , The Method can significantly improve the initial strategy 、 Sample efficiency and robustness . Residual physics can also improve the sample efficiency and the accuracy of the prediction model .

Thesis link :https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9796122

read more

 

title :Blockchain and Federated Deep Reinforcement Learning Based Secure Cloud-Edge-End Collaboration in Power IoT( North China Electric Power University : Sunxuan Zhang| Secure cloud collaboration based on blockchain and joint deep reinforcement learning in the power Internet of things )

brief introduction : Cloud collaboration is the Internet of things (PIOT) Provides a harmonious and efficient resource allocation . However , The security and complexity of computing offload has become a major obstacle . This paper first proposes secure cloud collaboration based on blockchain and artificial intelligence PIOT (BASE-PIOT) framework , To ensure data security and intelligent computing offload . This paper expounds its flexible resource allocation 、 Secure data sharing and differentiated service assurance . Then three typical blockchain pairs are analyzed PIOT The adaptability of , And gives BASE-PIOT Some typical application scenarios of , Including computing uninstall 、 Energy scheduling and access authentication . Finally, a joint depth based on blockchain is proposed actor-critic Task offload algorithm to solve the problem of secure and low latency computing offload . The coupling between long-term security constraints and short-term queue delay optimization is decoupled by using Lyapunov optimization . Numerical results verify the excellent performance of the algorithm in terms of total queuing delay and consistent delay .

Thesis link :https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9801730

read more

 

title :Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning( Peking University, : Haoqi Yuan| ICML 2022: The robust task representation of off-line meta reinforcement learning is realized by contrast learning )

brief introduction : The main content of this paper is off-line meta reinforcement learning , This is a practical reinforcement learning paradigm , You can learn from offline data to adapt to new tasks . The distribution of offline data is determined by behavior strategy and task . The existing off-line meta reinforcement learning algorithms can not distinguish these factors , Make the task representation unstable to the change of behavior strategy . In order to solve this problem , A contrastive learning framework for task representation is proposed ——CORRO(COntrastive Robust task Representation learning for OMRL), The framework is robust to the mismatching of behavior strategies in training and testing . and A double-layer encoder structure is designed , Use mutual information maximization to formalize task representation learning , Derive comparative learning objectives , Several methods are introduced to approximate the true distribution of negative pairs . Experiments on various off-line meta reinforcement learning benchmarks have proved The The advantages of the method over the previous method , Especially in the generalization of out of distribution behavior strategies .

Thesis link :https://arxiv.org/pdf/2206.10442.pdf

read more

 

title :The State of Sparse Training in Deep Reinforcement Learning( Google : Laura Graesser| ICML 2022: Sparse training state in deep reinforcement learning )

brief introduction : In recent years , The use of sparse neural networks in various fields of deep learning is growing rapidly , Especially in the field of computer vision . The attraction of sparse neural networks is mainly due to the reduction of the number of parameters required for training and storage , And the improvement of learning efficiency . It's kind of surprising , Few people try to explore them in deep reinforcement learning (DRL) Application in . This passage The application of some existing sparse training techniques in various deep reinforcement learning agents and environments is systematically investigated . The final results of the investigation confirm the results of sparse training in the field of computer vision —— In the field of deep reinforcement learning , For the same parameter count , The performance of sparse network is better than that of dense network . The author's team analyzed in detail how various components of deep reinforcement learning are affected by the use of sparse networks , And by proposing promising ways to improve the effectiveness of sparse training methods and promote their use in deep reinforcement learning .

Thesis link :https://arxiv.org/pdf/2206.10369.pdf

read more

 

title :Multi-UAV Planning for Cooperative Wildfire Coverage and Tracking with Quality-of-Service Guarantees( Georgia institute of technology : Esmaeil Seraj| Multi UAV wildfire cooperative coverage and tracking planning with service quality assurance )

brief introduction : In recent years , existing Research commissioned robots and drones (UAV) The team achieves accurate wildfire coverage and tracking . Although before Many studies This paper focuses on the coordination and control of this kind of multi robot system , But so far , These drone teams have not been able to track fires ( Location and propagation dynamics ) Reasoning , To provide performance guarantees over a period of time . This paper presents a prediction framework , Enable multi UAV teams to , Cooperate in collaborative wildfire coverage and fire tracking . The The method enables the UAV to infer the potential fire propagation dynamics , For long-term coordination under safety critical conditions . and A new set of analysis time and tracking error bounds are derived , So that the UAV team can allocate its limited resources and cover the entire fire area according to the estimated state of the specific situation , And provide probabilistic performance guarantee . The The scheme is generally applicable to search and rescue 、 Target tracking and border patrol . Quantitative evaluation verifies the performance of this method , Compared with the most advanced model-based and reinforcement learning benchmarks , Tracking errors are reduced respectively 7.5 Times and 9.0 times .

Thesis link :https://arxiv.org/pdf/2206.10544.pdf

read more

 

title :Graph Convolutional Recurrent Networks for Reward Shaping in Reinforcement Learning( Concordia University : Hani Sami| Graph volume product recurrent network is used to generate rewards in reinforcement learning )

brief introduction : This paper considers reinforcement learning (RL) Low speed convergence in , A new reward generation scheme is proposed , It is a combination of (1) Graph convolution cycle network (GCRN)、(2) Enhanced Krylov and (3) Look ahead to suggestions to form potential functions . adopt GCRN The architecture combines graph convolution networks (GCN) To capture spatial dependencies and bi-directional gated loop units (Bi-GRU) To solve the time dependency . and Yes GCRN The definition of loss function is combined with hidden Markov model (HMM) Message passing technology of . Because the transfer matrix of the environment is difficult to calculate , Use Krylov Base to estimate the transition matrix , Its performance is better than the existing approximate basis . Unlike existing potential functions that rely solely on state to perform reward shaping , Use both States and actions to produce more accurate recommendations through the forward-looking recommendation mechanism . Various tests show that , The The solution is superior to the current state-of-the-art solution in terms of learning speed , Get higher rewards at the same time .

Thesis link :https://www.sciencedirect.com/science/article/pii/S0020025522006442

read more

 

Research Summary

title : Qatar University :Omar Elharrouss | Backbones-Review: Feature extraction network of deep learning and deep reinforcement learning methods

brief introduction : In order to understand the real world using various types of data , Artificial intelligence (AI) Is the most commonly used technology today . Finding patterns in the analysis data is the main task , And selecting useful features from large-scale data is a crucial challenge . At present, with the convolutional neural network (CNN) The development of , Feature extraction becomes more automatic and simple .CNN Allow processing of large-scale data , And cover different scenarios for specific tasks . In computer vision tasks , Convolution networks are also used to extract features from other parts of the deep learning model . For feature extraction or DL Other parts of the model do not work randomly by choosing the appropriate network . therefore , The realization of this model may be related to the target task and its computational complexity . Many networks have become used in any AI task for any DL The famous network of models . These networks can be used for feature extraction or in any DL Model ( It's called the trunk ) Use at the beginning of . Backbone networks are known networks that have previously been trained in many other tasks , And its validity is proved . This paper summarizes the existing backbone network , Such as VGG、RESNET、DenseNet etc. , A detailed description and performance comparison are made .

Thesis link :https://arxiv.org/pdf/2206.08016.pdf

read more

 

title :Reinforcement Learning based Recommender Systems: A Survey( University of Calgary : M. Mehdi Afsa| A survey of recommendation systems based on reinforcement learning )

brief introduction : Recommendation system (RS) Has become an integral part of daily life . Traditionally , A recommendation question is considered a classification or prediction question , But now it is generally believed that , Expressing it as a sequential decision problem can better reflect the users - System interaction . therefore , It can be expressed as a Markov decision process (MDP) And through reinforcement learning (RL) Algorithm to solve . And traditional recommendation methods ( Including collaborative filtering and content-based filtering ) Different ,RL Able to process sequentially 、 Dynamic user system interaction , And take into account the long-term user participation . This paper introduces a recommendation system based on reinforcement learning (RLRS) The study of . First recognize and explain RLRS Usually it can be divided into based on RL and DRL Methods . then , A four part RLRS frame , I.e. status representation 、 Strategy optimization 、 Reward formulation and environmental construction , And summarize accordingly RLRS Algorithm . This article uses a variety of charts to highlight emerging themes and depict important trends . Last , Important aspects and challenges that can be solved in the future were discussed .

Thesis link :https://dl.acm.org/doi/pdf/10.1145/3543846

read more

 

If you're doing or focusing on reinforcement learning research 、 Implementation and application , Welcome to join “ Zhiyuan community - Reinforcement learning - Communication group ”. ad locum , You can :

 

Learn cutting edge knowledge 、 Solving puzzles

Share your experience 、 Show your talent

Participate in exclusive activities 、 Meet research partners

 

Please scan the QR code below to add .

 

 
原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206222034190080.html