Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

🌍 Any 3D Scene is Worth 1K Tokens:
3D-Grounded Representation for
Scene Generation at Scale

¹Westlake University, ²Afari Intelligent Drive, ³Zhejiang University

*: Equal Contribution, ✝: Project Lead, ✉️: Corresponding Author

Abstract

3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes.

In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views—at any resolution and aspect ratio—with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations.

Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.

Broader Impact

(1) We introduce a novel paradigm: repurposing 2D representation models to derive 3D-grounded latent representation for 3D scenes. Such 3D latent representation preserves the rich semantics inherent in 2D models while being grounded with 3D awareness. Crucially, such compact 3D latent representation is substantially more efficient than representing 3D scenes as multi-view images or videos. By performing diffusion modeling directly within this 3D latent space at scale, we demonstrate efficient 3D scene generation with strong spatial consistency.

(2) Our 3D latent space, particularly when derived from SigLIP2, is naturally compatible with mainstream large multimodal models. Potentially, it enables the conversion of multi-view observed 3D scenes into a fixed length set of 3D-aware and language-aligned tokens, facilitating 3D scene understanding without relying on image- or video-based large multimodal models as in many existing approaches.

(3) We open new possibilities toward building embodied agents that unify 3D worlds and language. Furthermore, our 3D latent space holds promise for unifying 3D scene understanding and generation within shared representation, contributing to the development of physical world models.

BibTeX

@article{wei20263drae, author = {Wei, Dongxu and Xu, Qi and Li, Zhiqi and Zhou, Hangning and Qiu, Cong and Qin, Hailong and Yang, Mu and Cui, Zhaopeng and Liu, Peidong}, title = {Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale}, journal = {arXiv}, year = {2026}, }

🌍 Any 3D Scene is Worth 1K Tokens:
3D-Grounded Representation for
Scene Generation at Scale

3DRAE grounds multi-view observed 3D scenes into fixed-length (e.g., 256, 512, 1024) 3D latent tokens, and decodes them back to view-based images and point maps. 3DDiT directly performs diffusion modeling within this 3D latent space to generate 3D scenes in 2 seconds.

Abstract

1K tokens represent 3D scenes observed by any number of views

Single-View Conditioned 3D Generation

Multi-View Conditioned 3D Generation

Unconditioned 3D Generation

Zero-Shot Results on Tanks & Temples, Mip-NeRF 360, and DTU datasets

Broader Impact

BibTeX

🌍 Any 3D Scene is Worth 1K Tokens:3D-Grounded Representation forScene Generation at Scale

3DRAE grounds multi-view observed 3D scenes into fixed-length (e.g., 256, 512, 1024) 3D latent tokens, and decodes them back to view-based images and point maps. 3DDiT directly performs diffusion modeling within this 3D latent space to generate 3D scenes in 2 seconds.

Abstract

1K tokens represent 3D scenes observed by any number of views

Single-View Conditioned 3D Generation

Multi-View Conditioned 3D Generation

Unconditioned 3D Generation

Zero-Shot Results on Tanks & Temples, Mip-NeRF 360, and DTU datasets

Broader Impact

BibTeX

🌍 Any 3D Scene is Worth 1K Tokens:
3D-Grounded Representation for
Scene Generation at Scale