Training Video Foundation Models with NVIDIA NeMo
Zeeshan Patel
Ethan He
Parth Mannan
Xiaowei Ren
Ryan Wolf
Niket Agarwal
Jacob Samuel Huffman
Z. Wang
Carl Wang
Jack Chang
Yan Bai
Tommy Huang
L. xilinx Wang
Sahil Jain
Shanmugam Ramasamy
Joseph Jennings
Ekaterina Sirazitdinova
Oleg Sudakov
Mingyuan Ma
Bobby Chen
Forrest Lin
Hao Wang
Vasanth Rao Naik Sabavat
Sriharsha Niverty
Rong Ou
Pallab Bhattacharya
David Page
Nima Tajbakhsh
Ashwath Aithal

Abstract
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
View on arXiv@article{patel2025_2503.12964, title={ Training Video Foundation Models with NVIDIA NeMo }, author={ Zeeshan Patel and Ethan He and Parth Mannan and Xiaowei Ren and Ryan Wolf and Niket Agarwal and Jacob Huffman and Zhuoyao Wang and Carl Wang and Jack Chang and Yan Bai and Tommy Huang and Linnan Wang and Sahil Jain and Shanmugam Ramasamy and Joseph Jennings and Ekaterina Sirazitdinova and Oleg Sudakov and Mingyuan Ma and Bobby Chen and Forrest Lin and Hao Wang and Vasanth Rao Naik Sabavat and Sriharsha Niverty and Rong Ou and Pallab Bhattacharya and David Page and Nima Tajbakhsh and Ashwath Aithal }, journal={arXiv preprint arXiv:2503.12964}, year={ 2025 } }
Comments on this paper