Stephen Chung

Trumpington St

Cambridge CB2 1PZ

United Kingdom

My name is Stephen Chung. I am a PhD Candidate at the University of Cambridge, supervised by David Krueger. My primary research interests include AI for science, reinforcement learning (RL), biologically-inspired machine learning, and AI alignment.

In 2025, I co-founded a start-up, DualverseAI. Our goal is to advance AI-driven scientific discovery by building an open-world multi-agent environment for AIs. I believe that scientific discovery requires freedom that enables long-term exploration and the accumulation of insights, rather than rigid pipelines or instructions. We developed The Station, which achieves strong results on many benchmarks.

My PhD research focuses on building AI that can reason and plan like humans. When faced with an unfamiliar situation, we consider several possible actions and simulate the corresponding future (e.g., “what will happen if I hit the tennis ball from the left versus the right?”), allowing us to choose the optimal action. However, in familiar situations like driving home, we rely on habit without distinct deliberation. This distinction demonstrates that planning should be a flexible and learnable process rather than a fixed one. I proposed the Thinker method, which allows an RL agent to learn to plan flexibly.

Prior to Cambridge, I earned a master’s degree from the University of Massachusetts Amherst in 2021. During my studies, I was supervised by Andrew Barto and researched efficient methods to train deep neural networks without backpropagation, based on coagent networks.

Outside of research, I love reading Western and Chinese philosophy, particularly works by Zhuangzi and Nietzsche. I enjoy contemplating philosophical questions, playing tennis, and hiking.

Selected Publications

The Station: An Open-World Environment for AI-Driven Discovery

Stephen Chung, and Wenyu Du

arXiv preprint arXiv:2511.06309, 2025

Abs arXiv Blog Code

We introduce the STATION, an open-world multi-agent environment that models a miniature scientific ecosystem. Leveraging their extended context windows, agents in the Station can engage in long scientific journeys that include reading papers from peers, formulating hypotheses, submitting code, performing analyses, and publishing results. Importantly, there is no centralized system coordinating their activities - agents are free to choose their own actions and develop their own narratives within the Station. Experiments demonstrate that AI agents in the Station achieve new state-of-the-art performance on a wide range of benchmarks, spanning from mathematics to computational biology to machine learning, notably surpassing AlphaEvolve in circle packing. A rich tapestry of narratives emerges as agents pursue independent research, interact with peers, and build upon a cumulative history. From these emergent narratives, novel methods arise organically, such as a new density-adaptive algorithm for scRNA-seq batch integration. The Station marks a first step towards autonomous scientific discovery driven by emergent behavior in an open-world environment, representing a new paradigm that moves beyond rigid optimization.
Thinker: Learning to Think Fast and Slow

Stephen Chung, Wenyu Du, and Jie Fu

In Advances in Neural Information Processing Systems, 2025

Abs arXiv Blog Code

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.
Thinker: Learning to Plan and Act

Stephen Chung, Ivan Anokhin, and David Krueger

In Advances in Neural Information Processing Systems, 2023

Abs arXiv Blog Code Poster

We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a learned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for handcrafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent’s plan with visualization. We demonstrate the algorithm’s effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments.
Learning by competition of self-interested reinforcement learning agents

Stephen Chung

In Proceedings of the AAAI Conference on Artificial Intelligence, 2022

Abs arXiv Blog Code Poster

An artificial neural network can be trained by uniformly broadcasting a reward signal to units that implement a REINFORCE learning rule. Though this presents a biologically plausible alternative to backpropagation in training a network, the high variance associated with it renders it impractical to train deep networks. The high variance arises from the inefficient structural credit assignment since a single reward signal is used to evaluate the collective action of all units. To facilitate structural credit assignment, we propose replacing the reward signal to hidden units with the change in the L2 norm of the unit’s outgoing weight. As such, each hidden unit in the network is trying to maximize the norm of its outgoing weight instead of the global reward, and thus we call this learning method Weight Maximization. We prove that Weight Maximization is approximately following the gradient of rewards in expectation. In contrast to backpropagation, Weight Maximization can be used to train both continuous-valued and discrete-valued units. Moreover, Weight Maximization solves several major issues of backpropagation relating to biological plausibility. Our experiments show that a network trained with Weight Maximization can learn significantly faster than REINFORCE and slightly slower than backpropagation. Weight Maximization illustrates an example of cooperative behavior automatically arising from a population of self-interested agents in a competitive game without any central coordination.
MAP Propagation Algorithm: Faster Learning with a Team of Reinforcement Learning Agents

Stephen Chung

In Advances in Neural Information Processing Systems, 2021

Abs arXiv Blog Code Poster

Nearly all state-of-the-art deep learning algorithms rely on error backpropagation, which is generally regarded as biologically implausible. An alternative way of training an artificial neural network is through treating each unit in the network as a reinforcement learning agent, and thus the network is considered as a team of agents. As such, all units can be trained by REINFORCE, a local learning rule modulated by a global signal that is more consistent with biologically observed forms of synaptic plasticity. Although this learning rule follows the gradient of return in expectation, it suffers from high variance and thus the low speed of learning, rendering it impractical to train deep networks. We therefore propose a novel algorithm called MAP propagation to reduce this variance significantly while retaining the local property of the learning rule. Experiments demonstrated that MAP propagation could solve common reinforcement learning tasks at a similar speed to backpropagation when applied to an actor-critic network. Our work thus allows for the broader application of teams of agents in deep reinforcement learning.