PERSIST: World Models with Persistent 3D State

Abstract

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down-stream tasks such as training agents.

To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space.

Pixels are a poor choice for encoding world state information

Within 3D environments, pixel observations provide partial, highly redundant, and viewpoint-dependent snapshots of the world. Rather than relying on pixels alone, could a world model maintain a persistent 3D representation of the environment that evolves over time? To investigate this question, we study world model learning in a Minecraft-inspired voxel environment. Voxel worlds discretise space into interpretable building blocks and support rich player–environment interactions, providing a natural testbed for evaluating how explicit 3D representations influence world model simulation.

Without a 3D state, spatial information is lost...

Pixel-based memory is expensive, limiting existing world models to a few seconds of visual memory. By maintaining a persistent 3D state, PERSIST enables efficient retrieval of spatial information over extended horizons.

Without 3D state (Oasis)

The scene changes dramatically when looking away.

With 3D state (Ours)

The scene remains consistent over time thanks to the spatial information encoded within the 3D state.

... and interactions with unseen objects are ignored.

By providing a partial view of the environment, pixel-based histories make modelling interactions with out-of-view objects difficult. We find that existing models tend to ignore these interactions. In contrast, our 3D representation tracks space all around the player, allowing PERSIST to model a broader class of agent-environment interactions.

Without 3D state (Oasis)

The model only predicts flat, uniform terrain behind the player as they move backwards.

With 3D state (Ours)

A collision gets successfully modelled as the player walks into a tree while moving backwards.

Key-frame retrieval struggles to generalise beyond past observations

Key-frame retrieval methods construct their context by selecting spatially and temporally relevant frames from the full history. However, retrieval usually necessitates warm-starting the generation pipeline with hundreds of ground truth pixel observations, and relies on these observations as a proxy for spatial memory. Exploring new regions or revisiting a scene from a new viewpoint often causes an immediate degradation in visual quality and spatial consistency.

PERSIST takes a fundamentally different approach. Rather than retrieving spatial information from past frames, it maintains a 3D representation that is actively regenerated at each timestep. As the agent moves, this representation evolves to reflect the current state of the world, enabling consistent spatial reasoning without relying on an ever-growing archive of past observations.

Key-frame retrieval (WorldMem)

Key-frame retrieval fails once the player leaves the starting area.

With 3D state (Ours)

PERSIST extends its 3D representation as the player explores, maintaining spatial consistency as the player moves into new areas or revisits old ones from new viewpoints.

Method

Initialised with a single pixel frame, PERSIST evolves in an auto-regressive loop in response to user actions . We first denoise the 3D environment centred on the agent in the form of a latent world frame . Next, camera parameters are predicted with a feed-forward transformer. We then project the world to the camera plane to form a depth-ordered stack of world latents . Finally, pixel latents are denoised, using pixel-aligned 3D information from the world latents stack as guidance.

New Capabilities

In addition to its improvement to the quality and coherence of generated experiences, we find that PERSIST's 3D representation confers a number of new capabilities.

3D World Generation

As a generative approach, our world frame flow matching model can produce diverse plausible world states from a single RGB observation at initialisation. Below, we visualise the iterative denoising of the world frame over 20 denoising steps. Each row corresponds to a specific input RGB frame, while each column shows a distinct sample generated using a different random seed.

The generated world frames capture the structure and semantics of the original view, while outpainting unseen regions in different plausible ways. This ability allows PERSIST to generate diverse and coherent environments, supporting a wide variety of interactive experiences.

Initial condition

Generated initial world frame (3 rollouts, 20 denoising steps)

3D Initialisation

Alternatively, we can leverage direct access to an explicit 3D representation to directly provide the model with 3D data. This explicit 3D conditioning allows a greater degree of control over the generated experience than providing an image, as the full surroundings of the agent can be specified. Here we show how closely PERSIST follows ground truth rollouts when it receives a ground truth world frame at initialisation.

Ground Truth

3D Initialisation

Dynamic World Editing

Our explicit 3D representation allows us to directly edit the world state mid-generation, enabling a new form of dynamic world editing that occurs directly in 3D space. We provide some examples below.

💡 Tip: Click on the arrows on either side to scroll through different edits.

No edits

Add desert at frame 20

Add hill at frame 20

Add tree at frame 20

Stable Extended Generations

Similarly to other autogressive models, PERSIST experiences visual artifacts induced by auto-regressive drift. However, we find that our 3D representation is inherently more stable and resistant to drift. As such, visual artifacts tend to recede when the model retrieves information from its 3D representation (e.g. when rotating the camera), making it possible to maintain stability over extended horizons.

The model successfully recovers from a visual artifact once the viewpoint changes at around 12 seconds.

Emergence of Persistent World Dynamics

We find that PERSIST learns to model environmental processes that evolve on their own, without direct player input. The world continues to change even when not directly observed, sometimes giving rise to unexpected interactions with the player.

💡 Tip: Click on the arrows on either side to scroll through videos.

A deep sea cave (bottom right of 3D visualisation) gets flooded as the player swims in the ocean above.

A waterfall floods the area, causing water to appear in the player's view. (3D visualisation sliced and zoomed in for clarity)

Current Limitations

An important limitation of our approach is the need for 3D annotations during training. Future work could explore learning the 3D representation from partial or synthetic annotations, or in a fully self-supervised manner. Like other autoregressive models, PERSIST suffers from generation artifacts caused by error accumulation over long rollouts (autoregressive drift). This issue can be mitigated through post-training that makes the model robust to its own generations, which we did not perform here.

Despite this, we find that PERSIST's 3D representation often enables recovery from artifacts. While comparable models become unstable after a few hundred frames, PERSIST frequently maintains stability for thousands of frames, although generation artifacts become more common over time. The video below shows a 2000-frame (83-second) episode illustrating several such artifacts and recoveries.

Step 0 (0 seconds). Start of the episode. Generation is stable.

Step 672 (28 seconds). A tree trunk disappears from the 3D representation. As a result only the tree foliage is rendered to the left of the agent.

Step 888 (37 seconds). Several wood blocs appear in the 3D representation. They get rendered in the agent's view.

Step 1296 (54 seconds). The texture of the ground is drifting to a yellowish colour. The 3D representation remains stable.

Step 1320 (55 seconds). The model recovers the correct texture thanks to the grounding provided by the 3D representation.

Step 1656 (69 seconds). A collision gets mispredicted and the agent clips through a bloc. The ground texture begins to drift.

Step 1680 (70 seconds). The model recovers from terrain clipping and texture drift after the agent executes a jump action.

Step 2000 (83 seconds). The episode concludes on a stable frame.

More Results

BibTeX

@article{garcin2026pixelhistoriesworldmodels,
 author = {Samuel Garcin and Thomas Walker and Steven McDonagh and Tim Pearce and Hakan Bilen and Tianyu He and Kaixin Wang and Jiang Bian},
 journal = {ArXiv preprint},
 title = {Beyond Pixel Histories: World Models with Persistent 3D State},
 url = {https://arxiv.org/abs/2603.03482},
 volume = {abs/2603.03482},
 year = {2026}
}

Beyond Pixel Histories:World Models with Persistent 3D State

Abstract

Pixels are a poor choice for encoding world state information

Without a 3D state, spatial information is lost...

Without 3D state (Oasis)

With 3D state (Ours)

... and interactions with unseen objects are ignored.

Without 3D state (Oasis)

With 3D state (Ours)

Key-frame retrieval struggles to generalise beyond past observations

Key-frame retrieval (WorldMem)

With 3D state (Ours)

Method

New Capabilities

3D World Generation

Initial condition

Generated initial world frame (3 rollouts, 20 denoising steps)

3D Initialisation

Ground Truth

3D Initialisation

Dynamic World Editing

No edits

Add desert at frame 20

Add hill at frame 20

Add tree at frame 20

Stable Extended Generations

Emergence of Persistent World Dynamics

Current Limitations

More Results

BibTeX

Beyond Pixel Histories:
World Models with Persistent 3D State