Urban morphology is fundamental to determining urban functionality and vitality. Prevailing simulation methods, however, often oversimplify morphological generation as a geometric problem, lacking the fusion of urban semantics and geographical context. To address this limitation, this study proposes ControlCity, a diffusion model that achieves comprehensive urban morphology generation through multimodal information fusion. We first constructed a quadruple ``image-text-metadata-building footprints" dataset from 22 cities worldwide. ControlCity utilizes this information as control conditions. Specifically, an enhanced ControlNet encodes image-based spatial constraints, while text and metadata provide semantic guidance and geographical context to collectively direct the generation. Experimental results demonstrate that compared to the unimodal baseline, this method achieves significant advantages in morphological fidelity. Specifically, FID (lower scores indicate less visual error) was reduced by 71.01%, reaching 50.94, and MIoU (higher scores indicate greater spatial overlap) improved by 38.46%, reaching 0.36. Furthermore, the model demonstrates robust knowledge generalization and controllability, enabling cross-city style transfer and zero-shot generation for unknown cities. Ablation studies reveal the distinct roles of images, text, and metadata in the generation. This study confirms that multimodal fusion is crucial for achieving the transition from ``geometric mimicry" to ``comprehensive generation," providing a novel paradigm for urban morphology research and applications.

View on arXiv

Comments on this paper