21

Arcee Trinity Large Technical Report

Varun Singh
Lucas Krauss
Sami Jaghouar
Matej Sirovatka
Charles Goddard
Fares Obied
Jack Min Ong
Jannik Straube
Fern
Aria Harley
Conner Stewart
Colin Kealty
Maziyar Panahi
Simon Kirsten
Anushka Deshpande
Anneketh Vij
Arthur Bresnu
Pranav Veldurthi
Raghav Ravishankar
Hardik Bishnoi
DatologyAI Team
Arcee AI Team
Prime Intellect Team
Mark McQuade
Johannes Hagemann
Lucas Atkins
Main:17 Pages
4 Figures
Bibliography:2 Pages
4 Tables
Appendix:10 Pages
Abstract

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available atthis https URL.

View on arXiv
Comments on this paper