On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices

31 March 2025

Abstract

We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(this https URL).

View on arXiv

@article{kim2025_2503.23796,
  title={ On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices },
  author={ Bosung Kim and Kyuhwan Lee and Isu Jeong and Jungmin Cheon and Yeojin Lee and Seulki Lee },
  journal={arXiv preprint arXiv:2503.23796},
  year={ 2025 }
}

Comments on this paper