Learning to plan and act by augmenting the environment
Planning is a crucial component of intelligence, equipping us with the ability to navigate through unanticipated and complex scenarios. To delve into the mechanics of planning, consider the game Sokoban as an example:
In Sokoban, the objective is to push all four boxes to the target area outlined in red. You control a green character capable of moving in four directions: up, down, left, and right. By approaching a box, you can push it. Although it appears straightforward, Sokoban demands significant planning, particularly since boxes cannot be pulled. For instance, pushing a box into a corner renders it immovable, creating a deadlock.
Now, let’s try solving this Sokoban level. What strategy would you employ?
Likely, you formulated several plans, ceasing to think further once you figured out a solution and selected the optimal move. In this exercise, there are two different processes: planning and acting. Planning refers to the general process of interacting with an imaginary environment, where we may think about different plans and imagine the results. Acting, on the other hand, refers to the process of executing an action in the real environment.
In this context, planning and acting are interrelated, as both involve decision-making in the game. During planning, we choose an action in the imagined game, whereas in acting, we choose an action in the real game. Planning can thus be perceived as a special type of action within the brain. This type of action, called internalized action in neuroscience
Existing AI planning algorithms, like A* and Monte-Carlo Tree Search (MCTS) used in AlphaGo
Given that planning and acting share mechanisms, it’s conceivable to apply the same learning algorithms to both processes. Reinforcement learning, primarily focused on teaching agents to act, might also teach them to plan. Therefore, we propose an augmented environment that consists of both the imaginary and real environments. By applying RL algorithms to this augmented environment, we aim for the RL agent to learn planning through imaginary actions in the imaginary environment. The goal is that these learned imaginary actions and results from the imaginary environment may help the agent select better real actions and achieve better performance in the real environment.
To construct the augmented environment (where environment refers to a Markov decision process), we utilize both an imaginary and a real environment. The imaginary environment is a learned or approximated version of the real one and is commonly known as a world model or simply model in RL.
During an imaginary step, the agent selects an imaginary action, and the world model predicts the subsequent state (and reward) based on the last imaginary state. The initial imaginary state is set to the current real state, ensuring that planning starts from the real state. Additionally, the agent can choose a special reset action to revert the imaginary state back to the current real state. This reset mechanism allows the agent to consider multiple plans, rather than being confined to a single, extended plan. For instance, if the initial plan proves ineffective, the agent can reset and propose a new plan from the beginning.
After a set number of imaginary steps, there is a single real step. In this step, the agent selects and executes a real action in the real environment. The sequence is illustrated as follows:
After the real step, another set of imaginary steps begin and the process repeats until the episode ends.
In the imaginary steps, the rewards are set to zero, while in the real steps, they are equal to the real rewards,
In our experiments, we discovered that agents were unable to learn planning within this augmented environment. Specifically, trained agents tended to bypass the imaginary steps, focusing instead on directly acting during the real step. We hypothesize that this outcome results from the complexity inherent in the planning process. This complexity includes the agent’s need to independently learn various sub-tasks of planning, such as (i) searching, (ii) evaluating, (iii) summarizing, and (iv) executing, particularly when dealing with high-dimensional state spaces. For instance, in Sokoban, each state is represented by an 80x80 RGB pixel image, resulting in a state dimension of (80, 80, 3). With 19 imaginary steps, this leads to 20 distinct images (including the real step) to process in a single real step. Consequently, the dimension of the combined information in a real step becomes 20 x 80 x 80 x 3, equating to 384,000. In such a scenario, it becomes substantially simpler for the agent to disregard the imaginary states and focus solely on the real state, thereby effectively omitting the planning process and ignoring the world model.
To address this challenge, we provide additional information about the imaginary states in the augmented environment, thereby simplifying the planning process. Drawing inspiration from MCTS, we provide the following key pieces of information:
We refer to the representation created by combining this information as the tree representation, as it effectively condenses the information from a tree-like structure. The tree representation usually encompasses fewer than 100 dimensions, making it significantly easier for learning to plan compared to the high-dimensional predicted states. In both real and imaginary steps, this tree representation, together with the original states (imaginary or real), is fed to the agent.
In psychology, dual-process theory
In this context, the real policy in our Thinker-augmented environment can be compared to System-1. It provides a distilled, intuitive guide for immediate actions, similar to our instinctual or gut responses. However, this real policy might not always be the most effective option. Through a more thoughtful planning process, mirroring the attributes of System-2, we often develop a superior policy compared to our initial, instinctive reactions. An illustrative example is in mathematical proof derivation: our initial intuitions can sometimes lead us astray. Yet, these intuitive leaps are crucial for beginning and navigating complex problems. Therefore, the real policy’s role in the Thinker-augmented environment is vital, serving as a foundation for more intricate decision-making processes.
We conducted experiments to assess the effectiveness of the Thinker-augmented environment, using a standard actor-critic algorithm on Thinker-augmented Sokoban and Atari. Here are the summarized findings:
Sokoban: Actor-critic algorithm applied to the Thinker-augmented Sokoban surpassed other state-of-the-art RL algorithms, including Dreamer-v3
Atari: We also compared the performance of actor-critic algorithm on Thinker-augmented Atari and the raw Atari environment. The results showed a significant improvement in performance in the augmented environment. This suggests that the benefits of the augmented environment extend beyond tasks requiring extensive planning.
Beyond raw performance metrics, the agent’s planning behavior in the Thinker-augmented environment is also very interesting. We designed the learned world model to be easily visualizable, allowing us to observe the agent’s planning process in detail. For example, the following video illustrates how a trained agent plans and acts: the left frame displays the real state, while the right frame displays the imaginary state generated by a learned world model. We can observe what the agent is planning in the right frame, and what the agent is doing in the left frame. Yellow or red tint indicates that the imaginary state has been reset.
The agent’s planning behavior notably differs from traditional handcrafted planning algorithms. In games like Sokoban, the agent learns to reset upon encountering an unsolvable state and generally formulates a few long-term plans. This approach is arguably more akin to human planning, where we typically abandon a plan once it’s deemed unworkable and think only a handful of plans, as opposed to generating and evaluating hundreds of plans concurrently like in MCTS. For a more in-depth analysis of the learned planning behaviour, please refer to Appendix F of our paper.
Our code, which is publicly available, facilitates the use of the Thinker-augmented environment with an interface compatible with OpenAI Gym:
import numpy as np
env_n = 16 # batch size
env = thinker.make("Sokoban-v0", env_n=env_n, gpu=False) # or atari games like "BreakoutNoFrameskip-v4"
initial_state = env.reset()
for _ in range(20):
primary_action = np.random.randint(5, size=env_n) # 5 possible actions in Sokoban
reset_action = np.random.randint(2, size=env_n) # 2 possible reset actions
state, reward, done, info = env.step(primary_action, reset_action)
print(state["tree_reps"].shape) # should be torch.Size([16, 79])
state is structured as a dictionary encompassing the tree representation, imaginary state, real state, and the hidden state of the world model. A notable feature is that the learning process for the world model is integrated within the env.step function. This integration means that users do not need to write separate code for managing the world model.
Additionally, the Thinker environment offers a high degree of configurability. For example, users can easily adjust parameters like the number of imaginary steps and the learning rate of the world model to suit their specific needs. Comprehensive documentation detailing these features and more is available in our Github repository.
Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments. The history of machine learning research tells us that learned approaches often prevail over handcrafted ones. This transition becomes especially pronounced when large amounts of data and computational power are available. In the same vein, we surmise that learned planning algorithms will eventually surpass handcrafted planning algorithms in the future.
We are venturing into several exciting avenues with this project and welcome collaboration. If you’re interested in this project, please contact me to discuss potential collaboration!