SEA-LION: Southeast Asian Languages in One Network

Abstract

Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.

View on arXiv

@article{ng2025_2504.05747,
  title={ SEA-LION: Southeast Asian Languages in One Network },
  author={ Raymond Ng and Thanh Ngan Nguyen and Yuli Huang and Ngee Chia Tai and Wai Yi Leong and Wei Qi Leong and Xianbin Yong and Jian Gang Ngui and Yosephine Susanto and Nicholas Cheng and Hamsawardhini Rengarajan and Peerat Limkonchotiwat and Adithya Venkatadri Hulagadri and Kok Wai Teng and Yeo Yeow Tong and Bryan Siow and Wei Yi Teo and Wayne Lau and Choon Meng Tan and Brandon Ong and Zhi Hao Ong and Jann Railey Montalan and Adwin Chan and Sajeban Antonyrex and Ren Lee and Esther Choa and David Ong Tat-Wee and Bing Jie Darius Liu and William Chandra Tjhi and Erik Cambria and Leslie Teo },
  journal={arXiv preprint arXiv:2504.05747},
  year={ 2025 }
}

Comments on this paper