Learning to plan and act by augmenting the environment

  1. Thinker: Learning to Plan and Act
    Stephen Chung, Ivan Anokhin, and David Krueger
    In Advances in Neural Information Processing Systems, 2023

Planning and Acting

Planning is a crucial component of intelligence, equipping us with the ability to navigate through unanticipated and complex scenarios. To delve into the mechanics of planning, consider the game Sokoban as an example:

Fig 1: Illustration of Sokoban.

In Sokoban, the objective is to push all four boxes to the target area outlined in red. You control a green character capable of moving in four directions: up, down, left, and right. By approaching a box, you can push it. Although it appears straightforward, Sokoban demands significant planning, particularly since boxes cannot be pulled. For instance, pushing a box into a corner renders it immovable, creating a deadlock.

Now, let’s try solving this Sokoban level. What strategy would you employ?

Likely, you formulated several plans, ceasing to think further once you figured out a solution and selected the optimal move. In this exercise, there are two different processes: planning and acting. Planning refers to the general process of interacting with an imaginary environment, where we may think about different plans and imagine the results. Acting, on the other hand, refers to the process of executing an action in the real environment.

In this context, planning and acting are interrelated, as both involve decision-making in the game. During planning, we choose an action in the imagined game, whereas in acting, we choose an action in the real game. Planning can thus be perceived as a special type of action within the brain. This type of action, called internalized action in neuroscience, is hypothesized to be the foundation of cognitive processes. Neuroscience research suggests similar mechanisms underlie these different types of actions. Studies in animals, like rats facing a T-junction in unfamiliar environments , also indicate this link between planning and acting.

Learning to Plan

Existing AI planning algorithms, like A* and Monte-Carlo Tree Search (MCTS) used in AlphaGo , differ fundamentally from animal-like planning. These algorithms, despite their sophistication, lack the flexibility inherent in natural planning processes. For example, we don’t need to formulate numerous plans for routine tasks like brushing teeth. The inherent inflexibility of these planning algorithms stems from the fact that they are handcrafted. As such, the goal of this project is to learn a planning algorithm instead of handcrafting it.

Given that planning and acting share mechanisms, it’s conceivable to apply the same learning algorithms to both processes. Reinforcement learning, primarily focused on teaching agents to act, might also teach them to plan. Therefore, we propose an augmented environment that consists of both the imaginary and real environments. By applying RL algorithms to this augmented environment, we aim for the RL agent to learn planning through imaginary actions in the imaginary environment. The goal is that these learned imaginary actions and results from the imaginary environment may help the agent select better real actions and achieve better performance in the real environment.

The Augmented Environment

To construct the augmented environment (where environment refers to a Markov decision process), we utilize both an imaginary and a real environment. The imaginary environment is a learned or approximated version of the real one and is commonly known as a world model or simply model in RL. Our paper also explores how to learn a sample-efficient world model that allows easy visualization, but this is not the primary focus of the study. To integrate both environments, we introduce a predetermined number of imaginary steps (19 in most of our experiments) before each real step.

Imaginary Steps and Real Steps

During an imaginary step, the agent selects an imaginary action, and the world model predicts the subsequent state (and reward) based on the last imaginary state. The initial imaginary state is set to the current real state, ensuring that planning starts from the real state. Additionally, the agent can choose a special reset action to revert the imaginary state back to the current real state. This reset mechanism allows the agent to consider multiple plans, rather than being confined to a single, extended plan. For instance, if the initial plan proves ineffective, the agent can reset and propose a new plan from the beginning.

After a set number of imaginary steps, there is a single real step. In this step, the agent selects and executes a real action in the real environment. The sequence is illustrated as follows:

Fig 2: Illustration of imaginary steps and real steps in the augmented environment. Actions on imaginary steps are called imaginary actions, which determine the transition of the imaginary state. The R button in the middle corresponds to the reset action, which resets the next imaginary state to the current real state. After a fixed number of imaginary steps, there is a single real step and the agent selects a real action.

After the real step, another set of imaginary steps begin and the process repeats until the episode ends.We are currently exploring more flexible ways of combining imaginary steps and real step. For example, we may let the agent decide when to stop planning and start acting.

In the imaginary steps, the rewards are set to zero, while in the real steps, they are equal to the real rewards,We explored around using auxiliary reward in imaginary step to speed up learning, but later found out that it does not impact performance, so that the objective in the real environment and the augmented environment are aligned. The purpose of imaginary steps is to provide additional information for the agent to select better real actions. As such, it is recommended to use a memory-based agent, such as an LSTM (Long Short-Term Memory) actor-critic.

Simplifying the Augmented Environment

In our experiments, we discovered that agents were unable to learn planning within this augmented environment. Specifically, trained agents tended to bypass the imaginary steps, focusing instead on directly acting during the real step. We hypothesize that this outcome results from the complexity inherent in the planning process. This complexity includes the agent’s need to independently learn various sub-tasks of planning, such as (i) searching, (ii) evaluating, (iii) summarizing, and (iv) executing, particularly when dealing with high-dimensional state spaces. For instance, in Sokoban, each state is represented by an 80x80 RGB pixel image, resulting in a state dimension of (80, 80, 3). With 19 imaginary steps, this leads to 20 distinct images (including the real step) to process in a single real step. Consequently, the dimension of the combined information in a real step becomes 20 x 80 x 80 x 3, equating to 384,000. In such a scenario, it becomes substantially simpler for the agent to disregard the imaginary states and focus solely on the real state, thereby effectively omitting the planning process and ignoring the world model.

To address this challenge, we provide additional information about the imaginary states in the augmented environment, thereby simplifying the planning process. Drawing inspiration from MCTS, we provide the following key pieces of information:

  1. The real policy \(\pi(\hat{s})\): This is the distribution of the real action over the state, trained through imitation learning based on real actions.
  2. Value \(v(\hat{s})\): This represents the expected sum of rewards for the state, trained using Temporal Difference (TD) learning on real steps.
  3. Predicted reward \(\hat{r}\): This is the predicted reward upon reaching the state.
  4. Additional hints: These may include statistics such as visit counts and average returns of plans.We refer to these as 'hints', as a memory-based agent should theoretically be able to calculate these statistics on its own from the other inputs provided.

We refer to the representation created by combining this information as the tree representation, as it effectively condenses the information from a tree-like structure. The tree representation usually encompasses fewer than 100 dimensions, making it significantly easier for learning to plan compared to the high-dimensional predicted states. In both real and imaginary steps, this tree representation, together with the original states (imaginary or real), is fed to the agent.We discovered that the tree representation is a critical component in learning to plan. Agents mainly rely on this tree representation, and their performance remains the same even when we exclude both imaginary and real states, providing only the tree representation. We have named the augmented environment the Thinker-augmented environment, reflecting its design which enables the agent to ‘think’ carefully before executing each real action. It is important to note that Thinker is an augmentation of the environment itself and not an RL algorithm. Consequently, any RL algorithm can be applied within the Thinker-augmented environment.

Relationship with Dual-process Theory

In psychology, dual-process theory distinguishes between two types of cognitive processes: System-1 and System-2. System-1 is fast, automatic, and often subconscious, akin to an instinctual response, while System-2 is slower, more deliberate, and logical.

In this context, the real policy in our Thinker-augmented environment can be compared to System-1. It provides a distilled, intuitive guide for immediate actions, similar to our instinctual or gut responses. However, this real policy might not always be the most effective option. Through a more thoughtful planning process, mirroring the attributes of System-2, we often develop a superior policy compared to our initial, instinctive reactions. An illustrative example is in mathematical proof derivation: our initial intuitions can sometimes lead us astray. Yet, these intuitive leaps are crucial for beginning and navigating complex problems. Therefore, the real policy’s role in the Thinker-augmented environment is vital, serving as a foundation for more intricate decision-making processes.

Experiment Results

We conducted experiments to assess the effectiveness of the Thinker-augmented environment, using a standard actor-critic algorithm on Thinker-augmented Sokoban and Atari. Here are the summarized findings:

Sokoban: Actor-critic algorithm applied to the Thinker-augmented Sokoban surpassed other state-of-the-art RL algorithms, including Dreamer-v3 and MuZero. The evaluation focused on the solving rate of levels within 50 million frames. However, we also included baseline results for a higher number of frames, as Sokoban typically requires over 300 million frames to achieve good performance.

Fig 3: Running average solving rate over the last 200 episodes in Sokoban, compared with other baselines, over 50 million frames (left) and 500 million frames (right).

Atari: We also compared the performance of actor-critic algorithm on Thinker-augmented Atari and the raw Atari environment. The results showed a significant improvement in performance in the augmented environment. This suggests that the benefits of the augmented environment extend beyond tasks requiring extensive planning. While Thinker's performance on Atari was impressive, it still lagged behind MuZero. This difference is likely due to the fewer frames used in our tests (200 million compared to MuZero's 20 billion). We hypothesize that increasing the replay ratio, similar to MuZero Reanalyze, could enhance performance in limited frame settings.

Fig 4: Running average solving rate over the last 200 episodes in Atari, compared with other baselines.

Agent Behavior Analysis

Beyond raw performance metrics, the agent’s planning behavior in the Thinker-augmented environment is also very interesting. We designed the learned world model to be easily visualizable, allowing us to observe the agent’s planning process in detail. For example, the following video illustrates how a trained agent plans and acts: the left frame displays the real state, while the right frame displays the imaginary state generated by a learned world model. We can observe what the agent is planning in the right frame, and what the agent is doing in the left frame. Yellow or red tint indicates that the imaginary state has been reset.

The agent’s planning behavior notably differs from traditional handcrafted planning algorithms. In games like Sokoban, the agent learns to reset upon encountering an unsolvable state and generally formulates a few long-term plans. This approach is arguably more akin to human planning, where we typically abandon a plan once it’s deemed unworkable and think only a handful of plans, as opposed to generating and evaluating hundreds of plans concurrently like in MCTS. For a more in-depth analysis of the learned planning behaviour, please refer to Appendix F of our paper.


Our code, which is publicly available, facilitates the use of the Thinker-augmented environment with an interface compatible with OpenAI Gym:

import thinker
import numpy as np
env_n = 16 # batch size
env = thinker.make("Sokoban-v0", env_n=env_n, gpu=False) # or atari games like "BreakoutNoFrameskip-v4"
initial_state = env.reset()
for _ in range(20):
    primary_action = np.random.randint(5, size=env_n) # 5 possible actions in Sokoban
    reset_action = np.random.randint(2, size=env_n) # 2 possible reset actions
    state, reward, done, info = env.step(primary_action, reset_action) 
print(state["tree_reps"].shape) # should be torch.Size([16, 79])

The state is structured as a dictionary encompassing the tree representation, imaginary state, real state, and the hidden state of the world model. A notable feature is that the learning process for the world model is integrated within the env.step function. This integration means that users do not need to write separate code for managing the world model.

Additionally, the Thinker environment offers a high degree of configurability. For example, users can easily adjust parameters like the number of imaginary steps and the learning rate of the world model to suit their specific needs. Comprehensive documentation detailing these features and more is available in our Github repository.


Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments. The history of machine learning research tells us that learned approaches often prevail over handcrafted ones. This transition becomes especially pronounced when large amounts of data and computational power are available. In the same vein, we surmise that learned planning algorithms will eventually surpass handcrafted planning algorithms in the future.

We are venturing into several exciting avenues with this project and welcome collaboration. If you’re interested in this project, please contact me to discuss potential collaboration!