From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts

3D layout tasks have traditionally concentrated on geometric constraints, but many practical applications demand richer contextual understanding that spans social interactions, cultural traditions, and usage conventions. Existing methods often rely on rule-based heuristics or narrowly trained learning models, making them difficult to generalize and frequently prone to orientation errors that break realism. To address these challenges, we define four escalating context levels, ranging from straightforward physical placement to complex cultural requirements such as religious customs and advanced social norms. We then propose a Vision-Language Model-based pipeline that inserts minimal visual cues for orientation guidance and employs iterative feedback to pinpoint, diagnose, and correct unnatural placements in an automated fashion. Each adjustment is revisited through the system's verification process until it achieves a coherent result, thereby eliminating the need for extensive user oversight or manual parameter tuning. Our experiments across these four context levels reveal marked improvements in rotation accuracy, distance control, and overall layout plausibility compared with native VLM. By reducing the dependence on pre-programmed constraints or prohibitively large training sets, our method enables fully automated scene composition for both everyday scenarios and specialized cultural tasks, moving toward a universally adaptable framework for 3D arrangement.
View on arXiv@article{asano2025_2503.23707, title={ From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts }, author={ Yuto Asano and Naruya Kondo and Tatsuki Fushimi and Yoichi Ochiai }, journal={arXiv preprint arXiv:2503.23707}, year={ 2025 } }