Stephen Chung

Trumpington St

Cambridge CB2 1PZ

United Kingdom

My name is Stephen Chung. I am now studying as a PhD student at the University of Cambridge, supervised by David Krueger. My primary research interest includes reinforcement learning (RL), biologically-inspired machine learning, and AI alignment.

My PhD focuses on building AI that can reason and plan like humans. When faced with an unfamiliar situation, we may think of several possible actions and simulate the corresponding future (e.g., what will happen if I hit the tennis ball from the left and right angles?), thereby allowing us to choose the action with the optimal result. However, in familiar situations like driving home, we may rely solely on our habits without overthinking. This distinction demonstrates that planning should be a flexible and learnable process instead of a fixed one. I am currently studying how to build AI that learns this planning process by interacting with the environment and how such learnable self-interaction may possibly yield more powerful cognitive capabilities such as reasoning, dreaming, and thinking. I also argue that explicitly teaching an AI to plan, instead of relying on an AI to learn to plan within a large neural network in a black-box manner (as is done in the current large-language models), is safer as we can have more control over the planning process. You can find details about this research here.

Before coming to Cambridge, I graduated from the University of Massachusetts Amherst with a master’s degree in 2021. During my master’s years, I was supervised by Andrew Barto, and studied methods to train a deep neural network without backpropagation efficiently based on coagent networks.

As for my interest, I love reading Western and Chinese philosophy books, such as Zhuangzi and Nietzche. I enjoy thinking about the world and philosophical questions. I also like playing tennis and hiking!

Selected Publications

Thinker: Learning to Plan and Act

Stephen Chung, Ivan Anokhin, and David Krueger

In Advances in Neural Information Processing Systems, 2023

Abs arXiv Code Poster Website

We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for handcrafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent’s plan with visualization. We demonstrate the algorithm’s effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments.
Learning by competition of self-interested reinforcement learning agents

Stephen Chung

In Proceedings of the AAAI Conference on Artificial Intelligence, 2022

Abs arXiv Code Poster Website

An artificial neural network can be trained by uniformly broadcasting a reward signal to units that implement a REINFORCE learning rule. Though this presents a biologically plausible alternative to backpropagation in training a network, the high variance associated with it renders it impractical to train deep networks. The high variance arises from the inefficient structural credit assignment since a single reward signal is used to evaluate the collective action of all units. To facilitate structural credit assignment, we propose replacing the reward signal to hidden units with the change in the L2 norm of the unit’s outgoing weight. As such, each hidden unit in the network is trying to maximize the norm of its outgoing weight instead of the global reward, and thus we call this learning method Weight Maximization. We prove that Weight Maximization is approximately following the gradient of rewards in expectation. In contrast to backpropagation, Weight Maximization can be used to train both continuous-valued and discrete-valued units. Moreover, Weight Maximization solves several major issues of backpropagation relating to biological plausibility. Our experiments show that a network trained with Weight Maximization can learn significantly faster than REINFORCE and slightly slower than backpropagation. Weight Maximization illustrates an example of cooperative behavior automatically arising from a population of self-interested agents in a competitive game without any central coordination.
MAP Propagation Algorithm: Faster Learning with a Team of Reinforcement Learning Agents

Stephen Chung

In Advances in Neural Information Processing Systems, 2021

Abs arXiv Code Poster Website

Nearly all state-of-the-art deep learning algorithms rely on error backpropagation, which is generally regarded as biologically implausible. An alternative way of training an artificial neural network is through treating each unit in the network as a reinforcement learning agent, and thus the network is considered as a team of agents. As such, all units can be trained by REINFORCE, a local learning rule modulated by a global signal that is more consistent with biologically observed forms of synaptic plasticity. Although this learning rule follows the gradient of return in expectation, it suffers from high variance and thus the low speed of learning, rendering it impractical to train deep networks. We therefore propose a novel algorithm called MAP propagation to reduce this variance significantly while retaining the local property of the learning rule. Experiments demonstrated that MAP propagation could solve common reinforcement learning tasks at a similar speed to backpropagation when applied to an actor-critic network. Our work thus allows for the broader application of teams of agents in deep reinforcement learning.