3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks.
Explore SceneGen's 3D scene generation results. Click on any example below to view the detailed generation process and results.
To interact with the 3D models, click and drag to rotate the view, scroll (or pinch) to zoom in and out, use the “🔄 Reset View” button to return to the default orientation, click “🔲 Wireframe” to toggle the wireframe overlay, and press “📷 Screenshot” to download the current view as an image.Our proposed SceneGen framework takes a single scene image and its corresponding object masks as inputs, and efficiently generates multiple 3D assets with coherent geometry, texture, and spatial arrangement in a single feedforward pass.
SceneGen takes a single scene image with multiple objects and corresponding segmentation masks as input. A pre-trained local attention block first refines the texture of each asset. Then, our introduced global attention block integrates asset-level and scene-level features extracted by dedicated visual and geometric encoders. Finally, two off-the-shelf structure decoders and our position head decode these latent features into multiple 3D assets with geometry, texture, and relative spatial positions.
Our proposed SceneGen is capable of generating physically plausible 3D scenes featuring complete structures, detailed textures, and precise spatial relationships, demonstrating superior performance over prior methods in terms of both geometric accuracy and visual quality on both the synthetic and real-world datasets.
Quantitative Comparisons on the 3D-FUTURE Test Set. We evaluate the geometric structure using scene-level Chamfer Distance (CD-S) and F-Score (F-Score-S), object-level Chamfer Distance (CD-O) and F-Score (F-Score-O), and volumetric IoU of object bounding boxes (IoU-B). For visual quality, CLIP-S and DINO-S represent CLIP and DINOv2 image-to-image similarity, respectively. We report the time cost for generating a single asset on a single A100 GPU, and * indicates adopting MV-Adapter for texture rendering.
Qualitative Comparisons on the 3D FUTURE Test Set and ScanNet++. Our proposed SceneGen is capable of generating physically plausible 3D scenes featuring complete structures, detailed textures, and precise spatial relationships, demonstrating superior performance over prior methods in terms of both geometric accuracy and visual quality on both the synthetic and real-world datasets.
@article{meng2025scenegen,
author = {Meng, Yanxu and Wu, Haoning and Zhang, Ya and Xie, Weidi},
title = {SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass},
journal = {arXiv preprint arXiv:2508.15769},
year = {2025},
}