Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM

18 March 2025

Abstract

Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.

View on arXiv

@article{alnumay2025_2503.14603,
  title={ Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM },
  author={ Yazeed Alnumay and Alexandre Barbet and Anna Bialas and William Darling and Shaan Desai and Joan Devassy and Kyle Duffy and Stephanie Howe and Olivia Lasche and Justin Lee and Anirudh Shrinivason and Jennifer Tracey },
  journal={arXiv preprint arXiv:2503.14603},
  year={ 2025 }
}

Comments on this paper