Arcee Trinity Large Technical Report

19 February 2026

Varun Singh

Lucas Krauss

Sami Jaghouar

Matej Sirovatka

Charles Goddard

Fares Obied

Jack Min Ong

Jannik Straube

Fern

Aria Harley

Conner Stewart

Colin Kealty

Maziyar Panahi

Simon Kirsten

Anushka Deshpande

Anneketh Vij

Arthur Bresnu

Pranav Veldurthi

Raghav Ravishankar

Hardik Bishnoi

DatologyAI Team

Arcee AI Team

Prime Intellect Team

Mark McQuade

Johannes Hagemann

Lucas Atkins

MoE

ArXiv (abs)PDF HTML HuggingFace (16 upvotes)Github (2306★)

Main:17 Pages

4 Figures

Bibliography:2 Pages

4 Tables

Appendix:10 Pages

Abstract

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available atthis https URL.

View on arXiv

Comments on this paper