ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2511.17127
104
0
v1v2 (latest)

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design

21 November 2025
Quentin G. Anthony
Yury Tokpanov
Skyler Szot
Srivatsan Rajagopal
Praneeth Medepalli
Rishi Iyer
Vasu Shyam
Anna Golubeva
Ansh Chaurasia
Xiao Yang
Tomás Figliolia
Robert Washbourne
Drew Thorstensen
Amartey Pearson
Zack Grossbart
Jason Van Patten
Emad Barsoum
Zhenyu Gu
Yao Fu
Beren Millidge
Beren Millidge
    MoEVLMLRM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github
Main:22 Pages
17 Figures
Bibliography:3 Pages
6 Tables
Appendix:3 Pages
Abstract

We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs and Pollara networking. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts over Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE, available atthis https URL) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

View on arXiv
Comments on this paper