SceneGen Icon SceneGen

Single-Image 3D Scene Generation in One Feedforward Pass

School of Artificial Intelligence, Shanghai Jiao Tong University
*Equal contribution   Corresponding author

Abstract

3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks.

🎨 Interactive Results Gallery

Explore SceneGen's 3D scene generation results. Click on any example below to view the detailed generation process and results.

To interact with the 3D models, click and drag to rotate the view, scroll (or pinch) to zoom in and out, use the “🔄 Reset View” button to return to the default orientation, click “🔲 Wireframe” to toggle the wireframe overlay, and press “📷 Screenshot” to download the current view as an image.


📸 Input Scene
Scene Input Image
🎯 Generated 3D Scene
📸 Input Scene
Scene Input Image
🎯 Generated 3D Scene
Scene Input Image
Scene Input Image
Scene Input Image
Scene Input Image
Scene Input Image
Scene Input Image

Overview

Our proposed SceneGen framework takes a single scene image and its corresponding object masks as inputs, and efficiently generates multiple 3D assets with coherent geometry, texture, and spatial arrangement in a single feedforward pass.

Architecture

Architecture Diagram

SceneGen takes a single scene image with multiple objects and corresponding segmentation masks as input. A pre-trained local attention block first refines the texture of each asset. Then, our introduced global attention block integrates asset-level and scene-level features extracted by dedicated visual and geometric encoders. Finally, two off-the-shelf structure decoders and our position head decode these latent features into multiple 3D assets with geometry, texture, and relative spatial positions.

Results

Our proposed SceneGen is capable of generating physically plausible 3D scenes featuring complete structures, detailed textures, and precise spatial relationships, demonstrating superior performance over prior methods in terms of both geometric accuracy and visual quality on both the synthetic and real-world datasets.

Quantitative Results Table

Quantitative Comparisons on the 3D-FUTURE Test Set. We evaluate the geometric structure using scene-level Chamfer Distance (CD-S) and F-Score (F-Score-S), object-level Chamfer Distance (CD-O) and F-Score (F-Score-O), and volumetric IoU of object bounding boxes (IoU-B). For visual quality, CLIP-S and DINO-S represent CLIP and DINOv2 image-to-image similarity, respectively. We report the time cost for generating a single asset on a single A100 GPU, and * indicates adopting MV-Adapter for texture rendering.

Quantitative Results Table

Qualitative Comparisons on the 3D FUTURE Test Set and ScanNet++. Our proposed SceneGen is capable of generating physically plausible 3D scenes featuring complete structures, detailed textures, and precise spatial relationships, demonstrating superior performance over prior methods in terms of both geometric accuracy and visual quality on both the synthetic and real-world datasets.

BibTeX

@article{meng2025scenegen,
  author    = {Meng, Yanxu and Wu, Haoning and Zhang, Ya and Xie, Weidi},
  title     = {SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass},
  journal   = {arXiv preprint arXiv:2508.15769},
  year      = {2025},
}