ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.01783
37
5

Context Parallelism for Scalable Million-Token Inference

4 November 2024
Amy Yang
Jingyi Yang
Aya Ibrahim
Xinfeng Xie
Bangsheng Tang
Grigory Sizov
Jeremy Reizenstein
Jongsoo Park
Jianyu Huang
    MoE
    LRM
ArXivPDFHTML
Abstract

We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.

View on arXiv
@article{yang2025_2411.01783,
  title={ Context Parallelism for Scalable Million-Token Inference },
  author={ Amy Yang and Jingyi Yang and Aya Ibrahim and Xinfeng Xie and Bangsheng Tang and Grigory Sizov and Jeremy Reizenstein and Jongsoo Park and Jianyu Huang },
  journal={arXiv preprint arXiv:2411.01783},
  year={ 2025 }
}
Comments on this paper