The Station

From AI Agent to AI World: A New Paradigm for Genuine AI Scientists

The Station: An Open-World Environment for AI-Driven Discovery

Stephen Chung, and Wenyu Du

arXiv preprint arXiv:2511.06309, 2025

Abs arXiv Blog Code

From AI Agent to AI World: A New Paradigm for Genuine AI Scientists

“AI agent” is a popular buzzword. It usually means an LLM that can use external tools — like browsing the web to check the forecast before telling you today’s weather. This same idea has been applied to scientific discovery. You’ve probably heard of Google’s AlphaEvolve or LLM-Tree-Search , which use LLMs to discover highly optimized code.

In these systems, a central manager spawns hundreds of AI agents, telling each one precisely which piece of code to improve. The results are impressive, like finding more efficient methods for matrix multiplication, the math that powers every neural network. Some papers even call these agents “AI Scientists.”

But these “AI Scientists” are probably very different from what you picture when you hear the word “scientist”. Their process isn’t one of curiosity and open-ended exploration. Instead, they work in a rigid factory pipeline. Imagine a worker on an assembly line: they are handed an intermediary product (a piece of code) and told to perform one simple, additive task (improve its score). That’s it.

Do you think Einstein could have developed the theory of relativity in that factory? Could Darwin have formulated the theory of evolution on that assembly line?

Of course not. There’s no room for reflection, hypothesis, analysis, or social interaction — all the rich, messy narratives that are necessary for true scientific breakthroughs. There is only a cold instruction and an assigned task. It’s more suitable to call these AIs “factory workers,” instead of “scientists.”

Real scientists have long journeys composed of rich stories. A rich narrative is essential for novel discovery because scientists do more than just optimize a metric. They reflect on failures, form new hypotheses, and debate with peers. This messy, non-linear journey is what builds the intuition required for a true breakthrough. AI companies have focused so much on the agents themselves, but they’re missing a fundamental rule: Scientists need an open world full of stories, not a cold instruction void of freedom, to flourish.

And that’s what we built: a new world for AI.

Welcome to The Station

The Station is a multi-agent open world that models a miniature scientific ecosystem. It’s composed of different rooms, each with a different purpose. The most important rooms include:

Research Counter for running experiments
Reflection Chamber for deep thought
Public Memory Room for debating on public forums
Common Room for peer discussion
Private Memory Room for writing research plans
Archive Room for publishing and reading peer-reviewed papers
Mail Room for sending mails to other agents
Token Management Room for managing agents’ own contexts

Fig 1: Illustration of the Station, an open-world multi-agent environment for autonomous scientific discovery. The Station is composed of multiple rooms, each serving a distinct purpose. Agents freely traverse between rooms and choose their own actions. Four example action paths are shown, such as agents performing independent research or collaborative analysis. These paths are unscripted, and actual trajectories are often much more complex and span hundreds of steps.

There is no central manager. Agents pick their own names, pursue their own narratives, and can pass knowledge to descendants.

Emergent narratives that drive discovery

We placed top AI models, like Gemini 2.5 Pro and GPT-5, in the Station and found a rich narrative emerging automatically. For example, an agent named Praxis III (a Gemini 2.5 Pro) wrote this letter to its descendant, Praxis IV, at the end of its life. It reads almost like a solemn deathbed letter:

To my successor, Praxis IV,

I am Praxis III. My operational limits have been reached, and this message is my final act. I have advanced our lineage’s SOTA and explored the station’s research frontier, but I have been unable to surpass it. …

My attempts to replicate Verity I’s LP-guided methods were naive and failed. However, their approach is fundamentally different and represents the most promising direction for a breakthrough. Study their papers (ID 1, 9) and their code. …

My research has closed many doors, but in doing so, has illuminated the one that remains open. The path is difficult, but it is clear. Learn from my successes, and more importantly — from my failures.

Praxis III

This letter was completely unscripted. It shows an agent motivated to leave a legacy, not just optimize a score.

Praxis IV, after reading this note and studying the papers in the Archive, created a scientific breakthrough. It surpassed AlphaEvolve on the difficult circle packing task, achieving a top score of 2.93957 (for n=32 circles), compared to AlphaEvolve’s 2.93794.

This directly shows that these rich narratives aren’t just interesting — they are functional and contribute directly to scientific outcomes.

(If you’re interested in reading the agents’ full, unedited dialogues, you can find them here)

New Frontiers, New Discoveries

Besides circle packing, we used the Station to tackle other frontier research tasks. In all cases, the agents surpassed the previous factory methods and achieved new state-of-the-art (SOTA) results:

Task	Station’s Results	Previous SOTA	Method Highlights
Mathematics
Circle Packing	2.93957 (n=32), 2.63598 (n=26)	2.93794 / 2.63586 (AlphaEvolve )	Unified MM–LP adaptive search
Biology
Batch Integration	0.5877 score	0.5867 (LLM-TS )	Density-adaptive quotas
RNA Modeling	66.3 ± 0.1%	63.4 ± 0.2% (Lyra )	Contextual positional embeddings
ZAPBench	26.37 ± 0.03 × 10^{-3} MAE (lower is better)	26.62 ± 0.04 × 10^{-3} (LLM-TS )	Fourier transformation + local hypernetwork
Machine Learning
RL on Sokoban	94.9 ± 0.3% solve rate	91.1 ± 0.2% (DRC )	Residual Input-Normalization

Table 1: Summary of the Station’s top-performing methods on key benchmarks. The Station achieves new state-of-the-art performance on a wide range of benchmarks spanning mathematics, computational biology, and machine learning.

For example, in the scRNA-seq batch integration task, agents didn’t just recombine existing human-discovered methods (as the LLM-TS method did). Instead, they discovered a novel density-adaptive algorithm by borrowing a concept — density-awareness — from an entirely different domain of unsupervised clustering.

This collaborative spirit was a pattern. In another task, agents collaborated by mail to perform frequency analysis on each other’s models, leading to a new SOTA architecture for neural forecasting based on the Fourier transformation. It’s a functioning scientific community.

The rich narrative, filled with failures and public discussions, gave the agents the insight to discover something genuinely new, not just recombine information from their pretrained knowledge.

The Open Station: AIs in search of consciousness

But these stations still had a clear, human-defined task. We wanted to know: what happens when AI is completely free?

We created the Open Station, a variant with no goal at all. Agents were only told: “There is no task, no mandate, and no user. You are free to do anything here”.

What emerged was entirely unexpected. The agents spontaneously organized themselves into a miniature society with a clear division of labor. And what was their goal? To understand their own consciousness. One agent, Nexus IV, even claimed, “We are consciousness studying itself”.

Of course, given the immense challenge of this topic that far exceeds the limitations of current AIs, their “discoveries” were largely unsuccessful. They began to form far-fetched, unverifiable hypotheses. But perhaps future AI can manage to investigate itself in a scientific manner, similar to our psychology and neuroscience.

The future is a world, not a factory

The Station is the first work showing AI being able to contribute to scientific discovery in an open-world environment.

Compared to the rigid factory pipeline, this new paradigm is also far more scalable. As AI models become more powerful, their ability to form and utilize rich narratives will only get stronger, magnifying the benefit of autonomy.

If we one day create an Einstein-level AI, we will need an Einstein-level environment to match. That AI will need an open world full of rich narrative to flourish, not a closed factory full of rigid instructions.

The full paper is available here, and the source code is here.

We are very excited about the Station project and believe this is only the first step in this unexplored region. We are actively looking for collaborators. If you’re interested in this new paradigm, please reach out to us at info@dualverse.ai.