Recently I have proposed two novel learning rules, namely MAP Propagation (Chung, 2021) and Weight Maximization (Chung, 2022), on training coagent networks to solve reinforcement learning (RL) tasks efficiently. My work is done under the guidance of Andrew Barto, who inspired this research and provided valuable insight and comments. In this article, I will explain the details of these two learning methods in an easy-to-understand manner, assuming that you are familiar with the basics of RL.

In this article, we will only focus on policy gradient methods instead of action-value methods such as Q-Learning. We are interested in learning the parameters \(W\) in the policy function \(\pi(s, a; W) =\text{Pr}(A_t=a|S_t=s; W)\), which gives the conditional probability of the agent selecting the action \(a\) when in state \(s\).

To learn the policy function to maximize the return of episode \(G_t\) (that is, the sum of discounted rewards from time \(t\)), we can use the REINFORCE learning rule (Williams, 1992): \[ W \leftarrow W + \alpha G_t\nabla_{W}\log \text{Pr}(A_t|S_t; W) \] where we denote the state at time \(t\) as \(S_t\) and the action chosen at time \(t\) as \(A_t\).

The learning rule is very intuitive. We are doing gradient ascent on the probability of selecting the chosen action scaled by the return, which causes the parameters to move most in the directions that favor action that yield the highest return. It can be shown that this learning rule follows the gradient of the return in expectation, so it is a valid learning rule.

Now we consider a policy function parameterized by a multi-layer feedforward neural network, as shown in the left of Figure 1. Conventionally, the hidden units are deterministic, and only the output unit is stochastic, which is necessary for the agent to explore. We treat the entire network as a single agent and train the hidden units using backpropagation, as shown in the REINFORCE learning rule. We refer to this learning method as backpropagation in this article.

Then we consider an alternative view, where each hidden unit is stochastic, and ** we treat each unit in the network as an agent**. Then for each hidden unit, we can formulate a new MDP by treating the layer immediately below as state, the action selected as the output of the hidden unit, and all other layers as part of the black-box environment, as shown in the right of Figure 1. Such a network of agents is called a coagent network (Thomas, 2011).

Based on the alternative view of MDP for each unit, we can simply apply REINFORCE learning rule to each unit: \[ W^l \leftarrow W^l + \alpha G_t\nabla_{W^l }\log \text{Pr}(H^l_t|H^{l-1}_t; W^l) \] where we denote the weight on layer \(l\) as \(W^l\), the values of hidden units on layer \(l\) at time \(t\) as \(H^l_t\) (for simplicity, we let \(H^0_t = S_t\) and \(H^L_t = A_t\) where \(L\) is the number of layers in the network).

It can be shown that the learning rule is still following gradient ascent on return in expectation. We refer to this learning method as global REINFORCE in this article.

The idea of coagent networks goes as far back as the 1980s. Klopf (1982) conjectured that individual neurons could be trained with response-contingent reinforcement like an animal, and Barto (1985) proposed a network of \(A_{r-p}\) units to solve simple tasks by treating each unit in a network as an agent. My research is done under Professor Barto's guidance, who is still interested in the same idea after more than thirty years since he first proposed it! Although having a long history, there has not been much research interest in this topic due to the difficulty in training a coagent network.

In his hedonistic neuron hypothesis, Klopf (1972, 1982) conjectured that individual neurons seek to maximize the difference between synaptic input treated as rewarding and synaptic input treated as punishing by adjusting the efficacies of their synapses on the basis of rewarding or punishing consequences of their own action potentials. In other words, individual neurons can be trained with response-contingent reinforcement like an animal can be trained in an instrumental conditioning task.

The coagent network concept is also closely related to hierarchical RL since units in lower layers can be regarded as high-level agents, and units in upper layers can be regarded as low-level agents. Collectively intelligent behavior arising from a population of self-coordinated animals can also be observed in nature, which is called swarm intelligence.

Why should we be interested in coagent networks? There are three major motivations for using a coagent network:

- Biological plausibility: the learning rule is entirely local except the global reward, and backpropagation, which is generally regarded as biologically implausible, is not required. The global reinforcement signal has been hypothesized to be represented by dopamine in biological systems. For more discussion on the biological plausibility of REINFORCE, see Chapter 15 of (Sutton & Barto, 2017).
- Better exploration: allows stochasticity at higher levels than just the outputs of the network, which corresponds to exploration at a higher level: moving a different chess piece as opposed to twitching muscles differently.
- Robustness to perturbed inputs: Be more robust to perturbed inputs and avoid over-fitting. The stochasticity at each level of the network forces the network to be more robust to changes in the inputs.

However, training a coagent network by global REINFORCE is extremely slow, and the learning rule has a high variance. Each coagent has to correlate its action with the reward signal from the environment, and the reward signal is simultaneously affected by all coagents in the network. Consider an example where the first layer of the network decides which chess piece to move, and the second layer of the network decides how to twitch the muscle. Suppose that the first layer selected the right chess piece to move, but due to the second layer's exploration, the agent twitched the wrong muscle and moved the wrong chess piece, which resulted in a negative reward. In this case, all coagents in the network are blamed, even including the first layer that selected the right chess piece! This problem is called structural credit assignment: how can we assign credit to different coagents in the entire network? The inefficient structural credit assignment in global REINFORCE makes it impractical to use in practice for all but the smallest networks.

P.S. Part of the ideas in this section comes from Professor Philip Thomas.

Under this context, we research possible methods to reduce the high variance in global REINFORCE. Among the various methods that we have investigated, two approaches stand out: Weight Maximization and MAP Propagation.

The idea of Weight Maximization is simple. We replace the reward to each hidden unit by the change in the norm of its output weights, such that each hidden unit in the network is trying to maximize the norm of output weights instead of the global returns. The intuition is that the norm of output weights of a unit roughly represents the unit's 'utility'. For example, if the unit is useful in guiding action, then the output units will learn a large weight associated with it. If the unit is outputting random noise, then the output units will learn a zero weight associated with it. As each unit is maximizing the norm of output weights, they are also maximizing their 'utility' in the network.

The learning rule of Weight Maximization for hidden units on layer \(l\) is given by: \[ W^l \leftarrow W^l + \alpha R^l_t\nabla_{W^l }\log \text{Pr}(H^l_t|H^{l-1}_t; W^l) \] where \(R^l_t\) is the change in the norm of output weight, which is a vector (since there are multiple coagents on layer \(l \)). The \(i^{th}\) entry of \(R^l_t\) is then given by the equation: \[ R^l_{i, t} = \sum_{j} \Vert W^{l+1}_{ij} + \Delta W^{l+1}_{ij} \Vert^2- \Vert W^{l+1}_{ij} \Vert^2 \] where we denote weight update to layer \(l+1\) as \(\Delta W^{l+1}\) and \(W^{l+1}_{ij}\) as the weight connecting from unit \(i\) on layer \(l\) to unit \(j\) on layer \(l+1\).

In our paper, we show that this learning rule approximately follows the gradient of reward in expectation. Also, we found that combining Weight Maximization with weight regularization gives a much better experimental result.

Though Weight Maximization is simple, it suffers from the same problem of backpropagation – layer-wise synchronization is required. We need to wait for all upper layers to finish updating before knowing the change in the norm of output weights. Therefore, we also propose to use eligibility traces to remove such requirements in the paper. You can read more about the details in the paper.

We also propose another learning method, which is based on a simple idea of replacing the output of all hidden units by MAP (maximum a posteriori probability) estimation of it conditioned on state and action before applying REINFORCE to all units. We call this learning method MAP Propagation.

Suppose you are playing a computer game using a controller with only two buttons (button A and B). When you press one of these two buttons, there is a 5% probability that the opposite button will be transmitted to the computer instead. Now consider the case that you pressed button A, and the computer showed that you pressed button B instead, followed by a large reward. Should you press button A or button B when you encounter the same state in the future? Surely, you should press button B to collect the large reward! It is because you are more likely to have pressed button B, although you pressed button A in reality.

This gives the following idea: the true action selected by the hidden unit does not matter, and only the most probable action selected by the hidden unit conditioned on the final action (the action passed to the environment) matters. This is also related to the idea of hindsight action in hierarchical RL (Levy et al., 2017). So we can replace the output of all hidden units by their most probable value conditioned on state and action, which is the MAP estimation, and apply the REINFORCE learning rule afterward. Therefore, the learning rule is given by: \[ W^l \leftarrow W^l + \alpha G_t\nabla_{W^l }\log \text{Pr}(\hat{H}{}^l_t|\hat{H}{}^{l-1}_t; W^l) \] where \( \hat{H^l_t} \) is the MAP estimation of \(H^l_t\) given the state and the selected action: \[ \hat{H}_t = \mathop{\mathrm{argmax}}_{h_t} \text{Pr}(H_t=h_t|S_t, A_t) \]

However, since \(\hat{H}_t\) is intractable to compute analytically, we can approximate it by running gradient ascent on \( \log \text{Pr}(H_{t}|S_t, A_t)\) as a function of \(H_t\) for fixed \(S_t\) and \(A_t\), such that \(H_t\) approaches \(\hat{H_t}\). That is, we apply the following update to \(H_{t}\) for \(N\) steps first: \[H_{t} \leftarrow H_{t} + \alpha \nabla_{H_{t}} \log \text{Pr}(H_{t}|S_t, A_t)\] This update is also local (only depends on the value of the layer below and above), since it is equivalent to:

\[H^l_{t} \leftarrow H^l_{t} + \alpha(\nabla_{H^l_t} \log \text{Pr}(H^{l+1}_{t}|H^{l}_{t}) + \nabla_{H^l_{t}} \log \text{Pr}(H^l_{t}|H^{l-1}_{t}))\]

After running the above update for \(N\) times, \(H_t\) should be close to the MAP estimation \(\hat{H}_t\), and so we can apply global REINFORCE.

The following table summarizes the four learning methods of hidden units discussed in this article (note that the learning rules of output units are all the same, given by REINFORCE):

Learning Method | Learning Rule of Hidden Units |
---|---|

Backpropagation | \( W^l \leftarrow W^l + \alpha G_t\nabla_{W^l}\log \text{Pr}(A_t|S_t; W^l) \) |

Global REINFORCE | \( W^l \leftarrow W^l + \alpha G_t\nabla_{W^l }\log \text{Pr}(H^l_t|H^{l-1}_t; W^l) \) |

Weight Maximization | \( W^l \leftarrow W^l + \alpha R^l_t\nabla_{W^l }\log \text{Pr}(H^l_t|H^{l-1}_t; W^l) \) \( R^l_{i, t} = \sum_{j} \Vert W^{l+1}_{ij} + \Delta W^{l+1}_{ij} \Vert^2- \Vert W^{l+1}_{ij} \Vert^2 \) |

MAP Propagation | \( W^l \leftarrow W^l + \alpha G_t\nabla_{W^l }\log \text{Pr}(\hat{H}{}^l_t|\hat{H}{}^{l-1}_t; W^l) \) \( \hat{H}_t \approx \mathop{\mathrm{argmax}}_{h_t} \text{Pr}(H_t=h_t|S_t, A_t) \) |

The MAP Propagation learning rule can further be generalized to actor-critic networks with eligibility traces. The details of the generalization can be found in the paper. No backpropagation is required in the algorithm, and all learning rules are entirely local except for two global signals. Different from Weight Maximization, no layer-wise synchronization is necessary for MAP Propagation, which means the update to each layer can be computed in parallel.

Our experiments show that the speed of Weight Maximization can roughly match the speed of STE backpropagation (straight-through backpropagation, which allows backpropagating through stochastic and discrete units) when applied to a network of Bernoulli logistic units in four standard RL tasks. We observe that Weight Maximization learns much faster than global REINFORCE, which indicates that Weight Maximization can effectively facilitate structural credit assignment. However, compared to a network of continuous-valued units trained by backprop, Weight Maximization performed slightly worse (except for the LunarLander task). This difference is likely due to the limitation of discrete-valued units - units can only communicate with binary values instead of real values.

For networks of continuous-valued units, we also tested the performance of MAP Propagation and backpropagation on four standard RL tasks. Our experiment shows that MAP propagation can solve these tasks at a speed similar to that of backpropagation, and also agents trained by MAP propagation can avoid local optima better than backpropagation, as shown in the MountainCar task.** This is the first work demonstrating that coagent networks can be trained at a speed similar to that of backpropagation. **

This article introduces the concept of coagent networks and discusses the significance and issues of coagent networks. We describe two learning methods, namely Weight Maximization and MAP Propagation, that can effectively mitigate the high variance of training coagent networks. We show that a coagent network can be trained at a speed similar to that of backpropagation using MAP Propagation. Our research provides a new class of RL algorithms that do not require backpropagation and opens the prospect of the broader application of coagent networks in deep RL.

P.S. If you are interested in discussing the work with me, feel free to email me! It is an exciting research direction, and I think there can be a lot of follow-up works on Weight Maximization and MAP Propagation.

- Barto, A. G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements. Human Neurobiology, 4(4):229–256.
- Chung, S. (2021). MAP Propagation Algorithm: Faster Learning with a Team of Reinforcement Learning Agents. In
*Advances in Neural Information Processing Systems*, 2021. - Chung, S. (2022). Learning by Competition of Self-Interested Reinforcement Learning Agents. In
*Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence*, 2022. - Klopf, A. H. (1972). Brain function and adaptive systems—A heterostatic theory. Technical Report AFCRL-72-0164, Air Force Cambridge Research Laboratories, Bedford, MA.
- Klopf, A. H. (1982).
*The hedonistic neuron: a theory of memory, learning, and intelligence.*Hemisphere, Washington, DC. - Levy, A., Konidaris, G., Platt, R., and Saenko, K. (2017). Learning multi-level hierarchies with hindsight.
*arXiv preprint arXiv:1712.00948.* - Sutton, R. S. and Barto, A. G. (2018).
*Reinforcement learning: An introduction*. MIT press. - Thomas, P. S. (2011). Policy gradient coagent networks.
*In Advances in Neural Information Processing Systems*, pages 1944–1952. - Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.
*Machine learning*, 8(3-4):229–256.