SceneGenesis: 3D Scene Synthesis via Semantic Structural Priors and Mesh-Guided Video-Geometry Fusion

Yueming Zhao; Hongyu Yang; Di Huang

doi:10.1609/aaai.v40i16.38333

Back to AAAI

AAAI 2026

SceneGenesis: 3D Scene Synthesis via Semantic Structural Priors and Mesh-Guided Video-Geometry Fusion

Conference Paper AAAI Technical Track on Computer Vision XIII Artificial Intelligence

PDF Details DOI

Abstract

Generating high-quality, controllable, and structurally consistent 3D scenes in complex multi-object environments remains a fundamental challenge. We present SceneGenesis, a unified framework that synthesizes 3D scenes by combining semantic structural priors with mesh-guided video–geometry fusion. SceneGenesis first employs large language models to convert textual descriptions into category-aware object specifications, which are transformed into structured meshes using procedural approximations and pretrained asset generators, enabling precise layout control and scalable scene construction. To obtain rich and style-controllable appearances, SceneGenesis generates multi-view video representations conditioned on the initialized structure. A mesh-guided video–geometry fusion module then consolidates video evidence with mesh priors through mesh-conditioned fragment initialization, progressive geometric refinement, and structure-aware optimization, substantially improving global geometric fidelity and visual realism. Experiments demonstrate that SceneGenesis supports flexible style variation and object-level editing while achieving strong controllability, scalability, and structural quality.

SceneGenesis: 3D Scene Synthesis via Semantic Structural Priors and Mesh-Guided Video-Geometry Fusion

Abstract

Authors

Keywords

Context