New research in Nature: distributed reinforcement learning mechanism in the brain

Compilation | Lei Feng Wang AI Technology Review

Lei Fengnet (Public Number: Lei Fengnet) Editor’s Note: There is a continuous and chaotic relationship between artificial intelligence and neuroscience / brain science. From the beginning of artificial intelligence, its research has been deeply influenced by neuroscience, including artificial neural networks, reinforcement learning and many other algorithms. Recently, more popular brain-like computing has put forward the idea of ​​”brain-inspire”. However, we often hear that artificial intelligence research is inspired by neuroscience / brain science; then, can neuroscience / brain science research be inspired by artificial intelligence research?

DeepMind’s recent article in Nature is such a model. Inspired by distributed reinforcement learning, they studied the physiological mechanism of mouse dopamine cells and found that “distributed reinforcement learning” is also used in the brain. Such research, on the one hand, has promoted the development of neuroscience, on the other hand, it has also verified that AI research is on the right path. Learning and motivation are driven by internal and external rewards. Many of our daily behaviors are guided by predicting or predicting whether a given action will lead to a positive (i.e. beneficial) result.

In his most famous experiment, Baplov trained dogs to respond to food expectations after the bell rang. These dogs began to salivate when they heard the sound before the food arrived, suggesting that they had learned to predict rewards. In the initial experiments, Pavlov estimated their expectations by measuring the amount of saliva they produced. But in recent decades, scientists have begun to decipher how the brain learns how these expectations work internally.

At the same time as neuroscientists’ research, computer scientists are also constantly developing algorithms for reinforcement learning in artificial systems. These algorithms enable AI systems to learn complex strategies without external guidance (instead guidance by reward prediction).

A new work recently published by DeepMind in Nature is inspired by the latest research in computer science (a major improvement on reinforcement learning algorithms), which provides a deep and minimalistic approach to some previously unexplained features of reward learning in the brain And opened up new ways to study the dopamine system in the brain. Can be described as a model of artificial intelligence research back neuroscience / brain science.

Prediction chain: time difference learning

Reinforcement learning is the most “old” and powerful idea formed by the combination of artificial intelligence and neuroscience. It has been around since the late 1980s. Computer science researchers at the time were trying to design an algorithm that incorporated rewards and punishments as feedback signals into the machine’s learning process, with the goal of enabling the machine to automatically perform complex behaviors. Reward behavior can strengthen the behavior of the machine, but to solve a specific problem, it is necessary to understand how the behavior of the current machine will bring future returns; in order to predict the total future return of a certain behavior, it is usually necessary to take many Measures.

The emergence of the time difference algorithm (TD) finds a breakthrough for solving the problem of reward prediction. TD uses a mathematical technique to replace the complex reasoning in the future through a very simple learning process, and it can also obtain the same result. Simply put, the TD algorithm does not calculate the total future return that can be obtained, but only predicts the instant reward and the reward that can be obtained in the next step. Then, when new information emerges the next time, compare the new prediction with what was expected.

If they are different, the algorithm calculates the difference between them and uses this “time difference” to adjust the old forecast to the new forecast. Constantly adjust to match expectations with reality, thereby gradually making the entire prediction chain more accurate. Around the end of the 1980s and early 1990s, neuroscientists studied the behavior of dopamine neurons and found that there is a relationship between the firing and reward of this neuron, and this relationship depends on sensory input, and if the research goal ( (Animals, for example) become more experienced in a task, and this relationship changes.

In the mid-1990s, a group of scientists were proficient in both neuroscience and artificial intelligence. They noticed that feedback from some dopamine neurons meant that the reward was predicted incorrectly. For example, when the animal received too much or too little reward compared to what was expected during training, these dopamines emitted signals. These scientists then proposed the TD algorithm used by the brain, which considers dopamine feedback and uses it to drive learning. Since then, this dopamine reward prediction error theory of dopamine has been validated in thousands of experiments and has become one of the most successful quantitative theories in neuroscience.

Distributed reinforcement learning

Computer scientists have not stopped here. Since 2013, more and more researchers have paid attention to deep reinforcement learning. This algorithm, which uses deep neural networks to learn representations in reinforcement learning, can effectively solve complex problems. problem.

Lei Feng Network Note: Figure 1: Probability represents the reward that may be received in the future, as shown in the figure above, red indicates a positive result, and green indicates a negative result.

Distributed reinforcement learning is one of the representatives. It can make reinforcement learning play a more excellent effect. In many cases (especially in the real world), future rewards generated by specific actions are random. As shown in the figure above, the “little man” in the figure does not know whether to cross the gap or fall into it, so the probability distribution graph of the predicted reward has two bulges: one represents a fall; one represents a successful crossing. The traditional method of TD algorithm is to predict the average value of future rewards. This obviously cannot obtain the two peaks (bulges) of the reward distribution. At this time, the distributed reinforcement learning can predict all the possibilities.

The simplest distributed reinforcement learning algorithm is closely related to the standard TD. This algorithm is also called distributed TD. The difference between the two is that the standard TD algorithm learns a single prediction or the expected value of the prediction; whereas the distributed TD learns a different set of predictions, each of which is learned using the standard TD method. But the key factor is that each predictor applies a different transformation to its reward prediction error.

Figure 2: a: “pessimistic” cells will magnify negative rewards or ignore positive rewards, optimistic cells will magnify positive rewards or ignore negative rewards; b: cumulative distribution of rewards; c: complete distribution of rewards

As shown in Figure a above, when the reward prediction error is positive, some predictors selectively “zoom in” or “enlarge” the reward prediction error (RPE). Compared to the higher part of reward allocation, this method enables the predictor to learn a more optimistic reward prediction. Also as shown above, other predictors amplify their negative reward prediction errors, so learn more pessimistic predictions. In summary, predictors containing pessimistic and optimistic rewards can draw a complete reward distribution map. In addition to simplicity, another benefit of distributed reinforcement learning is that it can be very powerful when used in conjunction with deep neural networks. Over the past 5 years, algorithms based on the original deep reinforcement learning DQN agent have made great progress and are often evaluated on the Atari-57 benchmark test set of the Atari 2600 game.

Figure 3: Comparing classic deep reinforcement learning with distributed reinforcement learning, the Atari-57 human-normalized scores on the Atari-57 benchmark

Figure 3 compares multiple standard RL and distributed RL algorithms that have been trained and evaluated under the same conditions on the same baseline. The distributed reinforcement learning agent is shown in blue, and it can be seen that a significant improvement has been achieved. Three of them (QR-DQN, IQN and FQF) are variants of the distributed TD algorithm we have been discussing. Why are distributed reinforcement learning algorithms so effective? Although this is still an active research topic, one of the points is that understanding the distribution of rewards will provide a stronger signal for the neural network to shape it in a way that is more robust to changes in the environment or strategy. Representation.

Because distributed time difference is so powerful in artificial neural networks, a scientific question arises: Can distributed time difference be applied to the brain? This was the original motivation that drove researchers to start this work on Nature. In this paper, DeepMind collaborated with Harvard Uchida Lab to analyze their records of mouse dopamine cells. These records document the learning abilities shown by mice in a task where they received unexpectedly large rewards (as shown in the color map of Figure 4):

Figure 4: In this task, mice were given randomly determined, variable-volume water rewards ranging in volume from 0.1ul to 20ul (the reward size is determined by the dice): (A) Dopamine cells simulated under the classic TD model Response to 7 different reward sizes; (B) In the distributed TD model, each row of dots corresponds to a dopamine cell, and each color corresponds to a different reward size. The color curve represents the spline interpolation of the data. The “reversal point” of a cell (the reward prediction error of a cell and the discharge rate intersect at a value of 0) is the expected reward that a particular cell will “tune” the reward. The discharge rate is not much and no less than its baseline rate; (C) The response of actual dopamine cells to different reward sizes therein is very close to the prediction of the distributed TD model.

The inset shows three example cells with different relative scaling for positive and negative reward prediction errors. Researchers evaluated whether dopamine neuron activity is more consistent with “standard time difference” or “distributed time difference”. As described above, distributed time difference relies on a different set of reward predictions.

Therefore, the first question of research is whether these true and diverse reward predictions can be found in neural data. In previous work, researchers learned that dopamine cells change their firing rate to imply a prediction error, that is, when animals receive more or less rewards than they expected, predictions occur. error.

And when the cell gets a reward exactly equal to its prediction, the prediction error is 0, so its discharge rate will not change. The researchers determined for each dopamine cell the size of the reward that would not change its baseline firing rate, which the researchers called a “reversal point” for the cells. They want to know if the “reversal point” between cells is different.

In Figure 4c, the authors show significant differences between cells, with some cells predicting very large rewards, while others predict very small rewards. These differences exceed the degree of difference expected from the original random variability in the record. In distributed time differences, these differences in reward predictions result from the selective amplification of positive or negative reward prediction errors. Amplifying positive reward prediction errors will make learning reward prediction more optimistic; while amplifying negative reward prediction errors will bring pessimistic reward prediction.

So the researchers next measured the relative magnification of different positive and negative predictions of different dopamine cell performance. Between cells, researchers have discovered diversity that can’t be explained by noise. And most importantly, the researchers found that the same cells that amplified the forward reward prediction error also had higher reversal points (Figure 4c, lower right corner), that is, they obviously adjusted the reversal point to Expectations of higher rewards. Finally, the distributed time difference theory predicts that different “reversal points” and different asymmetries between cells should code the learned reward distribution together. So the last question is whether it is possible to decode the reward distribution based on the discharge rate of dopamine cells.

Figure 5: Dopamine cells as a group, encode the shape of the learned reward distribution: you can give the discharge rate to code the reward distribution, and the gray shaded area is the real reward distribution encountered in the task. Each light blue trace shows an example of the decoding process performed. Dark blue represents the portion of the light blue track that averages out of the gray area.

As shown in Figure 5, the researchers found that using only the discharge rate of dopamine cells, it is entirely possible to reconstruct a reward distribution (blue trajectory), which is very close to the actual reward distribution (grey area) in mice participating in tasks. This reconstruction relies on interpreting the discharge rate of the dopamine cells as the reward prediction error of the distribution time difference model distribution and inferring to determine the distribution that the model already knows.

In summary, the researchers found that each dopamine neuron in the brain was tuned to a different degree of positive or negative. If they were a choir, they would not sing a note, but they would sing in harmony. They would all have a consistent vocal cord like a bass or soprano singer. In artificial reinforcement learning systems, this variety of adjustments creates richer training signals and greatly accelerates the learning speed in neural networks. Researchers infer that the brain will also adopt this variety of Way of adjustment. Existing distributed reinforcement learning in the brain has a very interesting impact on AI and neuroscience.

First, this finding validates distributed reinforcement learning, and makes us more convinced that AI research is on the right track, because distributed reinforcement learning algorithms have been applied to what we consider to be the smartest entity: the brain. Second, it raises new questions for neuroscience and provides a new perspective on understanding mental health and motivation.

What happens if a person’s brain selectively “listens” to optimistic or pessimistic dopamine neurons? Does this cause impulsive or depression? The advantage of the brain lies in its powerful representation ability-so, how does distributed learning form such a powerful representation ability? When an animal learns the reward distribution, how does it use this representation downstream? How do the various forward representations between dopamine cells relate to other forms of diversity known in the brain? These need to be further explored. We hope that more researchers can ask and answer questions like this to promote the advancement of neuroscience and in turn benefit AI research and form a benign closed loop!

Leave a Reply

Your email address will not be published. Required fields are marked *