ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.15961
9
0
v1v2 (latest)

TrainVerify: Equivalence-Based Verification for Distributed LLM Training

19 June 2025
Yunchi Lu
Youshan Miao
Cheng Tan
Peng Huang
Yi Zhu
Xian Zhang
Fan Yang
    LRM
ArXiv (abs)PDFHTML
Main:12 Pages
12 Figures
8 Tables
Appendix:9 Pages
Abstract

Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model's logical specification as the ground truth, TrainVerify formally verifies that a distributed parallel execution plan is mathematically equivalent to it. Direct verification is notoriously difficult due to the sheer scale of LLMs which often involves billions of variables and highly intricate computation graphs. Therefore, TrainVerify introduces shape-reduction techniques and a stage-wise parallel verification algorithm that significantly reduces complexity while preserving formal correctness. TrainVerify scales to frontier LLMs, including the successful verification of the Llama3 (405B) and DeepSeek-V3 (671B) training plans.

View on arXiv
@article{lu2025_2506.15961,
  title={ TrainVerify: Equivalence-Based Verification for Distributed LLM Training },
  author={ Yunchi Lu and Youshan Miao and Cheng Tan and Peng Huang and Yi Zhu and Xian Zhang and Fan Yang },
  journal={arXiv preprint arXiv:2506.15961},
  year={ 2025 }
}
Comments on this paper