Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to down-stream tasks such as training agents.
To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesize new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space.
Within 3D environments, pixel observations provide partial, highly redundant, and viewpoint-dependent snapshots of the world. Rather than relying on pixels alone, could a world model maintain a persistent 3D representation of the environment that evolves over time? To investigate this question, we study world model learning in a Minecraft-inspired voxel environment. Voxel worlds discretise space into interpretable building blocks and support rich player–environment interactions, providing a natural testbed for evaluating how explicit 3D representations influence world model simulation.
Pixel-based memory is expensive, limiting existing world models to a few seconds of visual memory. By maintaining a persistent 3D state, PERSIST enables efficient retrieval of spatial information over extended horizons.
By providing a partial view of the environment, pixel-based histories make modelling interactions with out-of-view objects difficult. We find that existing models tend to ignore these interactions. In contrast, our 3D representation tracks space all around the player, allowing PERSIST to model a broader class of agent-environment interactions.
Key-frame retrieval methods construct their context by selecting spatially and temporally relevant frames from the full history. However, retrieval usually necessitates warm-starting the generation pipeline with hundreds of ground truth pixel observations, and relies on these observations as a proxy for spatial memory. Exploring new regions or revisiting a scene from a new viewpoint often causes an immediate degradation in visual quality and spatial consistency.
PERSIST takes a fundamentally different approach. Rather than retrieving spatial information from past frames, it maintains a 3D representation that is actively regenerated at each timestep. As the agent moves, this representation evolves to reflect the current state of the world, enabling consistent spatial reasoning without relying on an ever-growing archive of past observations.
Initialised with a single pixel frame, PERSIST evolves in an auto-regressive loop in
response to user actions
. We first denoise the 3D environment centred on the agent in the form of a latent world frame
. Next, camera parameters
are predicted with a feed-forward transformer. We then project the world to the camera plane to form a
depth-ordered stack of world latents
. Finally, pixel latents
are denoised, using pixel-aligned 3D information from the world latents stack
as guidance.
In addition to its improvement to the quality and coherence of generated experiences, we find that PERSIST's 3D representation confers a number of new capabilities.
As a generative approach, our world frame flow matching model can produce diverse plausible world states from a single RGB observation at initialisation.
Below, we visualise the iterative denoising of the world frame
over 20 denoising steps.
Each row corresponds to a specific input RGB frame, while each column shows a distinct sample generated using a different random seed.
The generated world frames capture the structure and semantics of the original view, while outpainting unseen regions in different plausible ways. This ability allows PERSIST to generate diverse and coherent environments, supporting a wide variety of interactive experiences.
Alternatively, we can leverage direct access to an explicit 3D representation to directly provide the model with 3D data.
This explicit 3D conditioning allows a greater degree of control over the generated experience than providing an image, as the full surroundings of the agent can be specified.
Here we show how closely PERSIST follows ground truth rollouts when it receives a ground truth world frame
at initialisation.
Our explicit 3D representation allows us to directly edit the world state mid-generation, enabling a new form of dynamic world editing that occurs directly in 3D space. We provide some examples below.
Similarly to other autogressive models, PERSIST experiences visual artifacts induced by auto-regressive drift. However, we find that our 3D representation is inherently more stable and resistant to drift. As such, visual artifacts tend to recede when the model retrieves information from its 3D representation (e.g. when rotating the camera), making it possible to maintain stability over extended horizons.
We find that PERSIST learns to model environmental processes that evolve on their own, without direct player input. The world continues to change even when not directly observed, sometimes giving rise to unexpected interactions with the player.
An important limitation of our approach is the need for 3D annotations during training. Future work could explore learning the 3D representation from partial or synthetic annotations, or in a fully self-supervised manner. Like other autoregressive models, PERSIST suffers from generation artifacts caused by error accumulation over long rollouts (autoregressive drift). This issue can be mitigated through post-training that makes the model robust to its own generations, which we did not perform here.
Despite this, we find that PERSIST's 3D representation often enables recovery from artifacts. While comparable models become unstable after a few hundred frames, PERSIST frequently maintains stability for thousands of frames, although generation artifacts become more common over time. The video below shows a 2000-frame (83-second) episode illustrating several such artifacts and recoveries.
@article{garcin2026pixelhistoriesworldmodels,
author = {Samuel Garcin and Thomas Walker and Steven McDonagh and Tim Pearce and Hakan Bilen and Tianyu He and Kaixin Wang and Jiang Bian},
journal = {ArXiv preprint},
title = {Beyond Pixel Histories: World Models with Persistent 3D State},
url = {https://arxiv.org/abs/2603.03482},
volume = {abs/2603.03482},
year = {2026}
}