当前位置:网站首页>New Google brain research: how does reinforcement learning learn to observe with sound?
New Google brain research: how does reinforcement learning learn to observe with sound?
2022-06-24 02:59:00 【AI Technology Review】
compile | Wang Ye
proofreading | Victor
Humans have proved that , The nervous system in the brain has the ability to change its structure in order to constantly adapt to the changes of the external environment . Synapses in the brain 、 Connections between neurons can establish new connections due to the influence of learning and experience .
Corresponding , Sensory substitution (sensory substitution) This talent also exists in the human skill tree , For example, some people who are born blind can learn to perceive the outline and shape of the human body by converting images into sound .
If you allow AI With this ability , It can also be like bats and dolphins , Can use its ears to hear through sound and echo ‘ see ’ Like the world around you .
In recent days, , An article from Google brain entitled “The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning” My paper proves that reinforcement learning has this “ Sensory substitution ” The ability of .
Address of thesis :https://arxiv.org/pdf/2109.02869.pdf
To be specific , In this paper, the author designs a series of reinforcement learning systems , It can input every feeling from the environment (sensory inputs) To different , But in a neural network with certain connections , It is worth mentioning that , There is no fixed relationship between these neural networks . Studies have shown that , These sensory networks can be trained to integrate locally received information , And through the communication of attention mechanism , We can collectively reach an overall agreement .
Besides , Even in an event , The input order is randomly arranged many times , The system can still perform its tasks .
1
Proof process
Modern deep learning systems usually cannot adapt to the random reordering of sensory inputs , Unless the model is retrained or the user corrects the order of input for the model . However ,meta-learning This technology , It can help the model adapt to this change . for example adaptive weights、Hebbian-learning and model-based Other methods .
In the paper , The author studies agents They all have one thing in common : It is used to process sensory input when performing tasks , And reorder the input suddenly and randomly . Inspired by the latest development of self-organizing neural networks related to cellular automata , In the experiment, the author inputs each feeling ( It can be a single state in a continuous control environment , Or a pixel in the visual environment ) Input a separate neural network module , The module only integrates information from this specific sensory input channel for a period of time .
While receiving information locally , These individual sensory neural network modules also continuously broadcast output information . Reference resources Set Transformer framework , An attention mechanism combines this information , Form a global latent code (global latent code), Then convert it to agent Space for action . Attention mechanism can be regarded as a form of adaptive weighting of neural networks , under these circumstances , Allow any number of sensory inputs to be processed in any random order .
In the experiments , The authors found that each individual sensory neural network module , Although only local information can be received , But they can still work together to produce a globally consistent strategy , And such systems can be trained to perform several popular reinforcement learning (RL) Tasks in the environment . Besides , The system designed by the author can use a different number of sensory input channels in any random order , Even in one episode The order is rearranged again .
Pictured above pong agent, Even after giving it a small subset of the screen (30%), In a rearranged order , Can continue to work .
On the other hand , Encourage systematic learning of the permutation invariant representation of the coherence of the observation space , Can make policies More robust , More generalization . Studies have shown that , Without extra training , Even if other input channels containing noise or redundant information are added , The system can continue to operate . In the visual environment , Even if you give it only a small number of randomly selected blocks from the screen , And when it comes to testing , If you give it more blocks , The system can use additional information to perform better .
The author also proves that , Although training in a single fixed background , The system can also be extended to the visual environment with different background images . Last , In order to make the training more practical , The author proposes a behavioral cloning method (behavioral cloning) programme , The strategy trained by the existing methods is transformed into a permutation invariant strategy with ideal characteristics .
Figure note : Methods an overview
The image above AttentionNeuron It's a separate layer , Each of these sensory neurons can only access “ Disorderly observation (unordered observations)” Part of . combination agent The previous action of , Each neuron uses a shared function , Then generate information independently .
Figure note : List of symbols
In the above table , The author also provides the dimensions of our model for different reinforcement learning environments , So that readers can understand every part of the system .
Figure note :CartPoleSwingUpHarder The permutation in is invariant agent
In the above demonstration , Users can rearrange at any time 5 The order of the inputs , And observe agent How to adapt to the new order of input .
Demo address :https://attentionneuron.github.io/
Figure note : Pole test
The authors reported the results of each experiment 1000 Mean score and standard deviation of test events .agent Only in 5 Training in an environment of sensory input .
Figure note : Permutation invariant output
When the author inputs the sensor array as is ( Top ) Or randomly rearrange the array ( Bottom ) when ,Attention Neuron Layer output (16 Dimension global latent code ) Will not change . Yellow represents a higher value , Blue represents a lower value .
Figure note : Handle an unspecified number of additional noise channels
Without extra training ,agent receive 15 Input signals arranged in rearranged order , among 10 One is pure Gaussian noise (σ=0.1), in addition 5 One is the actual observation from the environment . Like the previous demonstration , The user can be right 15 Rearrange the order of the inputs , And observe agent How to adapt to the new input order .
Figure note : The two-dimensional embedding of the output of attention neuron layer in the test plot
The author highlights several representative groups in the figure , And show their sampling input . For each group, we show 3 A corresponding input ( That's ok ), And de heap each input to display the time dimension ( Column ).
CarRacing The basic task of ( Left ), Modified screen washing task ( Right ).
The author's agent Train only in this environment . As shown in the figure above , The screen on the right is agent Observed , On the left is what human vision observed . Humans will find it very difficult to drive by rearranging their observations , Because humans are not often exposed to such tasks , As mentioned earlier " Ride the bike backwards " Example .
2
Discussion and future
In this work , The author studies deep learning agents Characteristics of , these agents Their observations can be taken as an arbitrarily ordered 、 Variable length sensory input list . By processing each input stream independently , And use attention to integrate the processed information . Even if the order of observation is in a episode Was randomly changed many times , And no training ,agents You can still perform tasks . We report the performance comparison results for each environment in the table below .
Re sort out the observation results in the work carried out
At every episode in , The author every t step Rearrange the order and observe .CartPole Tasks vary greatly , So it was tested 1000 Time . Other tasks , Reported 100 Mean and standard deviation of tests . except Atari Pong, All environments have each episode 1000 step Hard limits on . stay Atari Pong in , Although there is no maximum length of a set , But it was observed that , Every episode It usually lasts 2500 step about .
By disrupting agent Sort , Even incomplete observation information , It can be driven to explain the meaning of each local sensory input and their relationship with the global , This has practical applications in many current applications . for example , When applied to robots , It can avoid cross wiring or complex dynamic input - Error caused by output mapping . Be similar to CartPole Experimental setup , Plus additional noise channels , A system that receives thousands of noise input channels can identify a small subset of channels with relevant information .
The limitation is , For the visual environment ,patch size The choice of will affect performance and computational complexity . The author found 6x6 Pixel patch size Very effective in the task ,4x4 Pixel patch size It can also work to some extent , But the observation of a single pixel does not work . Small patch size It also produces a large attention matrix , Unless approximate values are used , Otherwise, the calculation cost may be too high .
Another limitation , The property of permutation and combination invariance is only applicable to the input , Not for output . Although the ranking of observations can be disrupted again , But the sequencing of actions cannot . In order for the permutation invariant output to work , Each link needs feedback from the environment in order to learn the relationship between itself and the environment , Including reward information .
An interesting future research direction is to make the action layer have the same attributes , Each motor neuron is modeled as a module using attention connection . With the author's method , It is possible to train a person with any number of agent, Or use a single that is provided with a reward signal as feedback policy Control robots with different forms . Besides , In this work , The method designed by the author accepts previous actions as feedback signals . However , The feedback signal is not limited to action . Author expresses , They look forward to seeing signals of future work, including environmental rewards , Not only can it adapt to the observed environmental changes , Can also adapt to their own changes , Replace the constant with training meta-learning agents.
边栏推荐
- Tornado code for file download
- Activiti obtains the initiator based on the process instance ID
- The cost of on-site development of software talent outsourcing is higher than that of software project outsourcing. Why
- Prompt error when Jekyll runs cannot load such file -- webrick (LoadError)
- JMeter script FAQs
- 2022-2028 global aircraft wireless intercom system industry research and trend analysis report
- How to strengthen prison security measures? Technologies you can't imagine
- How to change the cloud desktop domain server password if you forget it?
- What is the meaning of scdo? Is it comparable to bGH
- 2022-2028 global high tibial osteotomy plate industry research and trend analysis report
猜你喜欢

IOS development - multithreading - thread safety (3)

2022-2028 global marine wet exhaust hose industry research and trend analysis report

2022-2028 global genome editing mutation detection kit industry survey and trend analysis report

2022-2028 global pilot night vision goggle industry research and trend analysis report

2022-2028 global aircraft audio control panel system industry research and trend analysis report

2022-2028 global cell-based seafood industry research and trend analysis report

2022-2028 global medical coating materials industry research and trend analysis report
![[51nod] 2106 an odd number times](/img/af/59b441420aa4f12fd50f5062a83fae.jpg)
[51nod] 2106 an odd number times

2022-2028 global anti counterfeiting label industry research and trend analysis report

2022-2028 global aircraft wireless intercom system industry research and trend analysis report
随机推荐
Grc: GRC interface is mixed with restful API
Grpc: implement service end flow restriction
LeetCode 205. Isomorphic Strings
Easynvr shows that the channel is online but cannot be played. Troubleshooting (non video streaming)
LeetCode 599. Minimum index sum of two lists
How to install the cloud desktop security server certificate? What can cloud desktops do?
Visual AI, first!
2022-2028 global aircraft audio control panel system industry research and trend analysis report
What about foreign trade companies? Is this another difficult year?
Grpc: based on cloud native environment, distinguish configuration files
South Korea's national network is disconnected. Who launched the network "attack"?
Innovation or hype? Is low code a real artifact or a fake tuyere?
Iranian gas station paralyzed by cyber attack, babuk blackmail software source code leaked | global network security hotspot
What is the meaning of scdo? Is it comparable to bGH
What are the security guarantees for cloud desktop servers? What are the cloud desktop server platforms?
Heavy release! Tencent security hosting service TA is here!
How does easynvr call the interface to modify a user-defined page?
Tornado code for file download
2022-2028 global third-party data platform industry research and trend analysis report
How to access the cloud game management server? Which cloud game management server can I choose?