ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.07072
603
2
v1v2 (latest)

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

9 April 2025
Israfel Salazar
Manuel Fernández Burda
Shayekh Bin Islam
Arshia Soltani Moakhar
Shivalika Singh
Fabian Farestam
Angelika Romanou
Danylo Boiko
Dipika Khullar
Mike Zhang
Dominik Krzemiñski
Jekaterina Novikova
Luísa Shimabucoro
Joseph Marvin Imperial
Rishabh Maheshwary
Sharad Duwal
Alfonso Amayuelas
Swati Rajwal
Jebish Purbey
Ahmed Ruby
Nicholas Popovič
Marek Šuppa
Azmine Toushik Wasi
Ram Mohan Rao Kadiyala
Olga Tsymboi
Maksim Kostritsya
Bardia Soltani Moakhar
Gabriel da Costa Merlin
Otávio Ferracioli Coletti
Maral Jabbari Shiviari
MohammadAmin farahani fard
Silvia Fernandez
María Grandury
Dmitry Abulkhanov
Drishti Sharma
Andre Guarnier De Mitri
Leticia Bossatto Marchezi
Setayesh Heydari
Johan Obando-Ceron
Nazar Kohut
Beyza Ermis
Desmond Elliott
Enzo Ferrante
Sara Hooker
    ELM
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)
Main:23 Pages
7 Figures
Bibliography:9 Pages
20 Tables
Appendix:22 Pages
Abstract

The evaluation of vision-language models (VLMs) has mainly relied on English-language benchmarks, leaving significant gaps in both multilingual and multicultural coverage. While multilingual benchmarks have expanded, both in size and languages, many rely on translations of English datasets, failing to capture cultural nuances. In this work, we propose Kaleidoscope, as the most comprehensive exam benchmark to date for the multilingual evaluation of vision-language models. Kaleidoscope is a large-scale, in-language multimodal benchmark designed to evaluate VLMs across diverse languages and visual inputs. Kaleidoscope covers 18 languages and 14 different subjects, amounting to a total of 20,911 multiple-choice questions. Built through an open science collaboration with a diverse group of researchers worldwide, Kaleidoscope ensures linguistic and cultural authenticity. We evaluate top-performing multilingual vision-language models and find that they perform poorly on low-resource languages and in complex multimodal scenarios. Our results highlight the need for progress on culturally inclusive multimodal evaluation frameworks.

View on arXiv
Comments on this paper