SceneGen: Single-Image 3D Scene Generation

SceneGen

Single-Image 3D Scene Generation in One Feedforward Pass

3DV 2026 — International Conference on 3D Vision

School of Artificial Intelligence, Shanghai Jiao Tong University

^*Equal contribution ^†Corresponding author

Abstract

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks.

Overview

Our proposed SceneGen framework takes a single scene image and its corresponding object masks as inputs, and efficiently generates multiple 3D assets with coherent geometry, texture, and spatial arrangement in a single feedforward pass.

Architecture

SceneGen takes a single scene image with multiple objects and corresponding segmentation masks as input. A pre-trained local attention block first refines the texture of each asset. Then, our introduced global attention block integrates asset-level and scene-level features extracted by dedicated visual and geometric encoders. Finally, two off-the-shelf structure decoders and our position head decode these latent features into multiple 3D assets with geometry, texture, and relative spatial positions.

Results

Our proposed SceneGen is capable of generating physically plausible 3D scenes featuring complete structures, detailed textures, and precise spatial relationships, demonstrating superior performance over prior methods in terms of both geometric accuracy and visual quality on both the synthetic and real-world datasets.

Quantitative Comparisons on the 3D-FUTURE Test Set. We evaluate the geometric structure using scene-level Chamfer Distance (CD-S) and F-Score (F-Score-S), object-level Chamfer Distance (CD-O) and F-Score (F-Score-O), and volumetric IoU of object bounding boxes (IoU-B). For visual quality, CLIP-S and DINO-S represent CLIP and DINOv2 image-to-image similarity, respectively. We report the time cost for generating a single asset on a single A100 GPU, and ^* indicates adopting MV-Adapter for texture rendering.

Qualitative Comparisons on the 3D FUTURE Test Set and ScanNet++. Our proposed SceneGen is capable of generating physically plausible 3D scenes featuring complete structures, detailed textures, and precise spatial relationships, demonstrating superior performance over prior methods in terms of both geometric accuracy and visual quality on both the synthetic and real-world datasets.

@inproceedings{meng2026scenegen, author = {Meng, Yanxu and Wu, Haoning and Zhang, Ya and Xie, Weidi}, title = {SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass}, booktitle = {International Conference on 3D Vision 2026}, year = {2026}, }

SceneGen

Single-Image 3D Scene Generation in One Feedforward Pass
3DV 2026 — International Conference on 3D Vision

Abstract

🎨 Interactive Results Gallery

Overview

Architecture

Results

BibTeX