🌍 Any 3D Scene is Worth 1K Tokens:
3D-Grounded Representation for
Scene Generation at Scale

1Westlake University, 2Afari Intelligent Drive, 3Zhejiang University
*: Equal Contribution, ✝: Project Lead, ✉️: Corresponding Author

3DRAE grounds multi-view observed 3D scenes into fixed-length (e.g., 256, 512, 1024) 3D latent tokens, and decodes them back to view-based images and point maps. 3DDiT directly performs diffusion modeling within this 3D latent space to generate 3D scenes in 2 seconds.

Abstract

3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes.

In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views—at any resolution and aspect ratio—with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations.

Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.

1K tokens represent 3D scenes observed by any number of views

We can use 3DRAE to encode any number of views into fixed-length 3D latent tokens, and decode them into images and pointmaps of arbitrary novel views.
Below we render point cloud and camera frustums for 3D scenes observed by different numbers (16 vs. 128) of input views.

Single-View Conditioned 3D Generation

Left: conditional view. Right: generated 3D scene with point cloud and camera frustum renderings. We highlight camera frustum of the conditional view in red.

Multi-View Conditioned 3D Generation

Left: conditional views. Right: generated 3D scene with point cloud and camera frustum renderings. We highlight camera frustums of the conditional views in red.

Unconditioned 3D Generation

Given a specific camera trajectory as condition, we can generate various 3D scenes by altering random seeds.

Zero-Shot Results on Tanks & Temples, Mip-NeRF 360, and DTU datasets

Left: conditional view. Right: generated 3D scene with point cloud and camera frustum renderings. We highlight camera frustum of the conditional view in red.

Broader Impact

(1) We introduce a novel paradigm: repurposing 2D representation models to derive 3D-grounded latent representation for 3D scenes. Such 3D latent representation preserves the rich semantics inherent in 2D models while being grounded with 3D awareness. Crucially, such compact 3D latent representation is substantially more efficient than representing 3D scenes as multi-view images or videos. By performing diffusion modeling directly within this 3D latent space at scale, we demonstrate efficient 3D scene generation with strong spatial consistency.

(2) Our 3D latent space, particularly when derived from SigLIP2, is naturally compatible with mainstream large multimodal models. Potentially, it enables the conversion of multi-view observed 3D scenes into a fixed length set of 3D-aware and language-aligned tokens, facilitating 3D scene understanding without relying on image- or video-based large multimodal models as in many existing approaches.

(3) We open new possibilities toward building embodied agents that unify 3D worlds and language. Furthermore, our 3D latent space holds promise for unifying 3D scene understanding and generation within shared representation, contributing to the development of physical world models.

BibTeX

@article{wei20263drae,
  author    = {Wei, Dongxu and Xu, Qi and Li, Zhiqi and Zhou, Hangning and Qiu, Cong and Qin, Hailong and Yang, Mu and Cui, Zhaopeng and Liu, Peidong},
  title     = {Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale},
  journal   = {arXiv},
  year      = {2026},
}