ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.14158
29
5

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

26 August 2024
Wei An
Xiao Bi
Guanting Chen
Shanhuang Chen
Chengqi Deng
Honghui Ding
Kai Dong
Qiushi Du
Wenjun Gao
Kang Guan
Jianzhong Guo
Yongqiang Guo
Zhe Fu
Ying He
Panpan Huang
Jiashi Li
Wenfeng Liang
Xiaodong Liu
Xin Liu
Yiyuan Liu
Yuxuan Liu
Shanghao Lu
Xuan Lu
Xiaotao Nie
Tian Pei
Junjie Qiu
Hui Qu
Z. Z. Ren
Zhangli Sha
Xuecheng Su
Xiaowen Sun
Yixuan Tan
Minghui Tang
Shiyu Wang
Yaohui Wang
Yongji Wang
Ziwei Xie
Yiliang Xiong
Yanhong Xu
Shengfeng Ye
Shuiping Yu
Yukun Zha
Liyue Zhang
Haowei Zhang
Mingchuan Zhang
Wentao Zhang
Yichao Zhang
Chenggang Zhao
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Yuheng Zou
ArXivPDFHTML
Abstract

The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC.

View on arXiv
Comments on this paper