Behind Maya: Building a Multilingual Vision Language Model
Nahid Alam
Karthik Reddy Kanjula
Surya Guthikonda
Timothy Chung
Bala Krishna S Vegesna
Abhipsha Das
Anthony Susevski
Ryan Sze-Yin Chan
S M Iftekhar Uddin
Shayekh Bin Islam
Roshan Santhosh
Snegha A
Drishti Sharma
Chen Liu
Isha Chaturvedi
Genta Indra Winata
Ashvanth.S
Snehanshu Mukherjee
Alham Fikri Aji

Abstract
In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available atthis https URL.
View on arXiv@article{alam2025_2505.08910, title={ Behind Maya: Building a Multilingual Vision Language Model }, author={ Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth.S and Snehanshu Mukherjee and Alham Fikri Aji }, journal={arXiv preprint arXiv:2505.08910}, year={ 2025 } }
Comments on this paper