当前位置：网站首页>New Google brain research: how does reinforcement learning learn to observe with sound?

New Google brain research: how does reinforcement learning learn to observe with sound?

2022-06-24 02:59:00 【AI Technology Review】

compile | Wang Ye

proofreading | Victor

Humans have proved that , The nervous system in the brain has the ability to change its structure in order to constantly adapt to the changes of the external environment . Synapses in the brain 、 Connections between neurons can establish new connections due to the influence of learning and experience .

Corresponding , Sensory substitution （sensory substitution） This talent also exists in the human skill tree , For example, some people who are born blind can learn to perceive the outline and shape of the human body by converting images into sound .

If you allow AI With this ability , It can also be like bats and dolphins , Can use its ears to hear through sound and echo ‘ see ’ Like the world around you .

In recent days, , An article from Google brain entitled “The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning” My paper proves that reinforcement learning has this “ Sensory substitution ” The ability of .

Address of thesis ：https://arxiv.org/pdf/2109.02869.pdf

To be specific , In this paper, the author designs a series of reinforcement learning systems , It can input every feeling from the environment （sensory inputs） To different , But in a neural network with certain connections , It is worth mentioning that , There is no fixed relationship between these neural networks . Studies have shown that , These sensory networks can be trained to integrate locally received information , And through the communication of attention mechanism , We can collectively reach an overall agreement .

Besides , Even in an event , The input order is randomly arranged many times , The system can still perform its tasks .

Proof process

Modern deep learning systems usually cannot adapt to the random reordering of sensory inputs , Unless the model is retrained or the user corrects the order of input for the model . However ,meta-learning This technology , It can help the model adapt to this change . for example adaptive weights、Hebbian-learning and model-based Other methods .

In the paper , The author studies agents They all have one thing in common ： It is used to process sensory input when performing tasks , And reorder the input suddenly and randomly . Inspired by the latest development of self-organizing neural networks related to cellular automata , In the experiment, the author inputs each feeling （ It can be a single state in a continuous control environment , Or a pixel in the visual environment ） Input a separate neural network module , The module only integrates information from this specific sensory input channel for a period of time .

While receiving information locally , These individual sensory neural network modules also continuously broadcast output information . Reference resources Set Transformer framework , An attention mechanism combines this information , Form a global latent code （global latent code）, Then convert it to agent Space for action . Attention mechanism can be regarded as a form of adaptive weighting of neural networks , under these circumstances , Allow any number of sensory inputs to be processed in any random order .

In the experiments , The authors found that each individual sensory neural network module , Although only local information can be received , But they can still work together to produce a globally consistent strategy , And such systems can be trained to perform several popular reinforcement learning （RL） Tasks in the environment . Besides , The system designed by the author can use a different number of sensory input channels in any random order , Even in one episode The order is rearranged again .

Pictured above pong agent, Even after giving it a small subset of the screen （30%）, In a rearranged order , Can continue to work .

On the other hand , Encourage systematic learning of the permutation invariant representation of the coherence of the observation space , Can make policies More robust , More generalization . Studies have shown that , Without extra training , Even if other input channels containing noise or redundant information are added , The system can continue to operate . In the visual environment , Even if you give it only a small number of randomly selected blocks from the screen , And when it comes to testing , If you give it more blocks , The system can use additional information to perform better .

The author also proves that , Although training in a single fixed background , The system can also be extended to the visual environment with different background images . Last , In order to make the training more practical , The author proposes a behavioral cloning method （behavioral cloning） programme , The strategy trained by the existing methods is transformed into a permutation invariant strategy with ideal characteristics .

Figure note ： Methods an overview

The image above AttentionNeuron It's a separate layer , Each of these sensory neurons can only access “ Disorderly observation （unordered observations）” Part of . combination agent The previous action of , Each neuron uses a shared function , Then generate information independently .

Figure note ： List of symbols

In the above table , The author also provides the dimensions of our model for different reinforcement learning environments , So that readers can understand every part of the system .

Figure note ：CartPoleSwingUpHarder The permutation in is invariant agent

In the above demonstration , Users can rearrange at any time 5 The order of the inputs , And observe agent How to adapt to the new order of input .

Demo address ：https://attentionneuron.github.io/

Figure note ： Pole test

The authors reported the results of each experiment 1000 Mean score and standard deviation of test events .agent Only in 5 Training in an environment of sensory input .

Figure note ： Permutation invariant output

When the author inputs the sensor array as is （ Top ） Or randomly rearrange the array （ Bottom ） when ,Attention Neuron Layer output （16 Dimension global latent code ） Will not change . Yellow represents a higher value , Blue represents a lower value .

Figure note ： Handle an unspecified number of additional noise channels

Without extra training ,agent receive 15 Input signals arranged in rearranged order , among 10 One is pure Gaussian noise （σ=0.1）, in addition 5 One is the actual observation from the environment . Like the previous demonstration , The user can be right 15 Rearrange the order of the inputs , And observe agent How to adapt to the new input order .

Figure note ： The two-dimensional embedding of the output of attention neuron layer in the test plot

The author highlights several representative groups in the figure , And show their sampling input . For each group, we show 3 A corresponding input （ That's ok ）, And de heap each input to display the time dimension （ Column ）.

CarRacing The basic task of （ Left ）, Modified screen washing task （ Right ）.

The author's agent Train only in this environment . As shown in the figure above , The screen on the right is agent Observed , On the left is what human vision observed . Humans will find it very difficult to drive by rearranging their observations , Because humans are not often exposed to such tasks , As mentioned earlier " Ride the bike backwards " Example .

Discussion and future

In this work , The author studies deep learning agents Characteristics of , these agents Their observations can be taken as an arbitrarily ordered 、 Variable length sensory input list . By processing each input stream independently , And use attention to integrate the processed information . Even if the order of observation is in a episode Was randomly changed many times , And no training ,agents You can still perform tasks . We report the performance comparison results for each environment in the table below .

Re sort out the observation results in the work carried out

At every episode in , The author every t step Rearrange the order and observe .CartPole Tasks vary greatly , So it was tested 1000 Time . Other tasks , Reported 100 Mean and standard deviation of tests . except Atari Pong, All environments have each episode 1000 step Hard limits on . stay Atari Pong in , Although there is no maximum length of a set , But it was observed that , Every episode It usually lasts 2500 step about .

By disrupting agent Sort , Even incomplete observation information , It can be driven to explain the meaning of each local sensory input and their relationship with the global , This has practical applications in many current applications . for example , When applied to robots , It can avoid cross wiring or complex dynamic input - Error caused by output mapping . Be similar to CartPole Experimental setup , Plus additional noise channels , A system that receives thousands of noise input channels can identify a small subset of channels with relevant information .

The limitation is , For the visual environment ,patch size The choice of will affect performance and computational complexity . The author found 6x6 Pixel patch size Very effective in the task ,4x4 Pixel patch size It can also work to some extent , But the observation of a single pixel does not work . Small patch size It also produces a large attention matrix , Unless approximate values are used , Otherwise, the calculation cost may be too high .

Another limitation , The property of permutation and combination invariance is only applicable to the input , Not for output . Although the ranking of observations can be disrupted again , But the sequencing of actions cannot . In order for the permutation invariant output to work , Each link needs feedback from the environment in order to learn the relationship between itself and the environment , Including reward information .

An interesting future research direction is to make the action layer have the same attributes , Each motor neuron is modeled as a module using attention connection . With the author's method , It is possible to train a person with any number of agent, Or use a single that is provided with a reward signal as feedback policy Control robots with different forms . Besides , In this work , The method designed by the author accepts previous actions as feedback signals . However , The feedback signal is not limited to action . Author expresses , They look forward to seeing signals of future work, including environmental rewards , Not only can it adapt to the observed environmental changes , Can also adapt to their own changes , Replace the constant with training meta-learning agents.

原网站

版权声明
本文为[AI Technology Review]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/10/20211021171023976s.html

当前位置：网站首页>New Google brain research: how does reinforcement learning learn to observe with sound?

New Google brain research: how does reinforcement learning learn to observe with sound?

边栏推荐

猜你喜欢

随机推荐