3D scene generation has long been dominated by 2D multi-view or video diffusion models. This is due not only to the lack of scene-level 3D latent representation, but also to the fact that most scene-level 3D visual data exists in the form of multi-view images or videos, which are naturally compatible with 2D diffusion architectures. Typically, these 2D-based approaches degrade 3D spatial extrapolation to 2D temporal extension, which introduces two fundamental issues: (i) representing 3D scenes via 2D views leads to significant representation redundancy, and (ii) latent space rooted in 2D inherently limits the spatial consistency of the generated 3D scenes.
In this paper, we propose, for the first time, to perform 3D scene generation directly within an implicit 3D latent space to address these limitations. First, we repurpose frozen 2D representation encoders to construct our 3D Representation Autoencoder (3DRAE), which grounds view-coupled 2D semantic representations into a view-decoupled 3D latent representation. This enables representing 3D scenes observed from arbitrary numbers of views—at any resolution and aspect ratio—with fixed complexity and rich semantics. Then we introduce 3D Diffusion Transformer (3DDiT), which performs diffusion modeling in this 3D latent space, achieving remarkably efficient and spatially consistent 3D scene generation while supporting diverse conditioning configurations.
Moreover, since our approach directly generates a 3D scene representation, it can be decoded to images and optional point maps along arbitrary camera trajectories without requiring per-trajectory diffusion sampling pass, which is common in 2D-based approaches.
We can use 3DRAE to encode any number of views into fixed-length 3D latent tokens, and decode them into images and pointmaps of arbitrary novel views.
Below we render point cloud and camera frustums for 3D scenes observed by different numbers (16 vs. 128) of input views.
Left: conditional view. Right: generated 3D scene with point cloud and camera frustum renderings. We highlight camera frustum of the conditional view in red.
Left: conditional views. Right: generated 3D scene with point cloud and camera frustum renderings. We highlight camera frustums of the conditional views in red.
Given a specific camera trajectory as condition, we can generate various 3D scenes by altering random seeds.
Left: conditional view. Right: generated 3D scene with point cloud and camera frustum renderings. We highlight camera frustum of the conditional view in red.
(1) We introduce a novel paradigm: repurposing 2D representation models to derive 3D-grounded latent representation for 3D scenes. Such 3D latent representation preserves the rich semantics inherent in 2D models while being grounded with 3D awareness. Crucially, such compact 3D latent representation is substantially more efficient than representing 3D scenes as multi-view images or videos. By performing diffusion modeling directly within this 3D latent space at scale, we demonstrate efficient 3D scene generation with strong spatial consistency.
(2) Our 3D latent space, particularly when derived from SigLIP2, is naturally compatible with mainstream large multimodal models. Potentially, it enables the conversion of multi-view observed 3D scenes into a fixed length set of 3D-aware and language-aligned tokens, facilitating 3D scene understanding without relying on image- or video-based large multimodal models as in many existing approaches.
(3) We open new possibilities toward building embodied agents that unify 3D worlds and language. Furthermore, our 3D latent space holds promise for unifying 3D scene understanding and generation within shared representation, contributing to the development of physical world models.
@article{wei20263drae,
author = {Wei, Dongxu and Xu, Qi and Li, Zhiqi and Zhou, Hangning and Qiu, Cong and Qin, Hailong and Yang, Mu and Cui, Zhaopeng and Liu, Peidong},
title = {Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale},
journal = {arXiv},
year = {2026},
}