Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code atthis https URL.
View on arXiv@article{nvidia2025_2503.14492, title={ Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control }, author={ NVIDIA and Hassan Abu Alhaija and Jose Alvarez and Maciej Bala and Tiffany Cai and Tianshi Cao and Liz Cha and Joshua Chen and Mike Chen and Francesco Ferroni and Sanja Fidler and Dieter Fox and Yunhao Ge and Jinwei Gu and Ali Hassani and Michael Isaev and Pooya Jannaty and Shiyi Lan and Tobias Lasser and Huan Ling and Ming-Yu Liu and Xian Liu and Yifan Lu and Alice Luo and Qianli Ma and Hanzi Mao and Fabio Ramos and Xuanchi Ren and Tianchang Shen and Xinglong Sun and Shitao Tang and Ting-Chun Wang and Jay Wu and Jiashu Xu and Stella Xu and Kevin Xie and Yuchong Ye and Xiaodong Yang and Xiaohui Zeng and Yu Zeng }, journal={arXiv preprint arXiv:2503.14492}, year={ 2025 } }