171

Apple Intelligence Foundation Language Models: Tech Report 2025

Alex Guillen Garcia
Guoli Yin
Lezhi Li
Mohana Prasad Sathya Moorthy
Hongbin Gao
Jay Tang
Joanna Arreaza-Taylor
Faye Lao
Carina Peng
Josh Shaffer
Dan Masi
Sushma Rao
Tommi Vehvilainen
Senyu Tong
Dongcai Shen
Yang Zhao
Chris Bartels
Peter Fu
Qingqing Cao
Christopher Neubauer
Ethan Li
Mingfei Gao
Rebecca Callahan
Richard Wei
Patrick Dong
Alex Braunstein
Sachin Ravi
Adolfo Lopez Mendez
Kaiwei Huang
Kun Duan
Haoshuo Huang
Rui Qian
Stefano Ligas
Jordan Huffaker
Dongxu Li
Bailin Wang
Nanzhu Wang
Anuva Agarwal
Tait Madsen
Josh Newnham
Abhishek Sharma
Zhile Ren
Deepak Gopinath
Erik Daxberger
Saptarshi Guha
Oron Levy
Jing Lu
Nan Dun
Marc Kirchner
Yinfei Yang
Manjot Bilkhu
Dave Nelson
Anthony Spalvieri-Kruse
Juan Lao Tebar
Yang Xu
Phani Mutyala
Gabriel Jacoby-Cooper
Yingbo Wang
Karla Vega
Vishaal Mahtani
Darren Botten
Eric Wang
Hanli Li
Matthias Paulik
Haoran Yan
Navid Shiee
Yihao Qian
Bugu Wu
Qi Zhu
Ob Adaranijo
Bhuwan Dhingra
Zhe Gan
Nicholas Seidl
Grace Duanmu
Rong Situ
Yiping Ma
Yin Xia
David Riazati
Vasileios Saveris
Anh Nguyen
Michael
Patrick Sonnenberg
Chinguun Erdenebileg
Yanghao Li
Vivian Ma
James Chou
Isha Garg
Mark Lee
Keen You
Yuhong Li
Ransen Niu
Nandhitha Raghuram
Pulkit Agrawal
Henry Mason
Sumeet Singh
Keyu He
Hong-You Chen
Lucas Guibert
Shiyu Li
Varsha Paidi
Narendran Raghavan
Mingze Xu
Yuli Yang
Sergiu Sima
Irina Belousova
Sprite Chu
Afshin Dehghan
Philipp Dufter
David Haldimann
Zhen Yang
Margit Bowler
Chang Liu
Ying-Chang Cheng
Vivek Rathod
Syd Evans
Wilson Tsao
Dustin Withers
Haitian Sun
Biyao Wang
Peter Grasch
Walker Cheng
Yihao Feng
Vivek Kumar
Frank Chu
Victoria MönchJuan Haladjian
Doug Kang
Jiarui Lu
Ciro Sannino
Max Lam
Floris Weers
Bowen Pan
Kenneth Jung
Dhaval Doshi
Fangping Shi
Olli Saarikivi
Alp Aygar
Josh Elman
Cheng Leong
Eshan Verma
Matthew Lei
Jeff Nichols
Jiulong Shan
Donald Zhang
Lawrence Zhou
Stephen Murphy
Xianzhi Du
Chang Lan
Ankur Jain
Elmira Amirloo
Marcin Eichner
Naomy Sabo
Anupama Mann Anupama
David Qiu
Zhao Meng
Michael FitzMaurice
Peng Zhang
Simon Yeung
Chen Chen
Marco Zuliani
Andrew Hansen
Yang Lu
Brent Ramerth
Ziyi Zhong
Parsa Mazaheri
Matthew Hopkins
Mengyu Li
Simon Wang
David Chen
Farzin Rasteh
Chong Wang
Josh Gardner
Asaf Liberman
Haoxuan You
Andrew Walkingshaw
Xingyu Zhou
Jinhao Lei
Yan Meng
Quentin Keunebroek
Sam Wiseman
Anders Boesen Lindbo Larsen
Yi Zhang
Zaid Ahmed
Haiming Gang
Aaron Franklin
Kelvin Zou
Guillaume Seguin
Jonathan Janke
Rachel Burger
Co Giang
Cheng Shen
Jen Liu
Sanskruti Shah
Xiang Kong
Yiran Fei
TJ Collins
Chen Zhang
Zhiyun Lu
Michael Booker
Qin Ba
Yasutaka Tanaka
Andres Romero Mier Y Teran
Federico Scozzafava
Regan Poston
Jane Li
Eduardo Jimenez
Bas Straathof
Karanjeet Singh
Lindsay Hislop
Rajat Arora
Deepa Seshadri
Boyue Li
Colorado Reed
Zhen Li
TJ Lu
Yi Wang
Kaelen Haag
Nicholas Lusskin
Raunak Sinha
Rahul Nair
Eldon Schoop
Mary Beth Kery
Mehrdad Farajtbar
Brenda Yang
George Horrell
Shiwen Zhao
Dhruti Shah
Cha Chen
Bowen Zhang
Chang Gao
Devi Krishna
Jennifer Mallalieu
Javier Movellan
Di Feng
Emily Zhang
Sam Xu
Junting Pan
Dominik Moritz
Suma Jayaram
Kevin Smith
Dongseong Hwang
Daniel Parilla
Jiaming Hu
You-Cyuan Jhang
Emad Soroush
Fred Hohman
Nan Du
Emma Wang
Sam Dodge
Pragnya Sridhar
Joris Pelemans
Wei Fang
Nina Wenzel
Joseph Yitan Cheng
Hadas Kotek
Chung-Cheng Chiu
Meng Cao
Haijing Fu
Ruixuan Hou
Ke Ye
Diane Zhu
Nikhil Bhendawade
Joseph Astrauskas
Jian Liu
Sai Aitharaju
Wentao Wu
Artsiom Peshko
Hyunjik Kim
Nilesh Shahdadpuri
Andy De Wang
Qi Shan
Piotr Maj
Raul Rea Menacho
Justin Lazarow
Eric Liang Yang
Arsalan Farooq
Donghan Yu
David Güera
Minsik Cho
Kavya Nerella
Yongqiang Wang
Tao Jia
John Park
Jeff Lai
Haotian Zhang
Futang Peng
Daniele Molinari
Aparna Rajamani
Tyler Johnson
Lauren Gardiner
Chao Jia
Violet Yao
Wojciech Kryscinski
Xiujun Li
Shang-Chen Wu
et al. (294 additional authors not shown)
Main:19 Pages
5 Figures
Bibliography:4 Pages
3 Tables
Appendix:4 Pages
Abstract

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines.

View on arXiv
Comments on this paper