Trillion 7B Technical Report

Abstract
We introduce Trillion-7B, the most token-efficient Korean-centric multilingual LLM available. Our novel Cross-lingual Document Attention (XLDA) mechanism enables highly efficient and effective knowledge transfer from English to target languages like Korean and Japanese. Combined with optimized data mixtures, language-specific filtering, and tailored tokenizer construction, Trillion-7B achieves competitive performance while dedicating only 10\% of its 2T training tokens to multilingual data and requiring just 59.4K H100 GPU hours (\
View on arXiv@article{han2025_2504.15431, title={ Trillion 7B Technical Report }, author={ Sungjun Han and Juyoung Suk and Suyeong An and Hyungguk Kim and Kyuseok Kim and Wonsuk Yang and Seungtaek Choi and Jamin Shin }, journal={arXiv preprint arXiv:2504.15431}, year={ 2025 } }
Comments on this paper