We introduce a novel approach that takes a single semantic mask as input to synthesize multi-view consistent color images of natural scenes, trained with a collection of single images from the Internet. Prior works on 3D-aware image synthesis either require multi-view supervision or learning category-level prior for specific classes of objects, which can hardly work for natural scenes. Our key idea to solve this challenging problem is to use a semantic field as the intermediate representation, which is easier to reconstruct from an input semantic mask and then translate to a radiance field with the assistance of off-the-shelf semantic image synthesis models. Experiments show that our method outperforms baseline methods and produces photorealistic, multi-view consistent videos of a variety of natural scenes.
View on arXiv@article{zhang2025_2302.07224, title={ Painting 3D Nature in 2D: View Synthesis of Natural Scenes from a Single Semantic Mask }, author={ Shangzan Zhang and Sida Peng and Tianrun Chen and Linzhan Mou and Haotong Lin and Kaicheng Yu and Yiyi Liao and Xiaowei Zhou }, journal={arXiv preprint arXiv:2302.07224}, year={ 2025 } }