ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.12705
88
9

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

16 October 2024
Genta Indra Winata
Frederikus Hudi
Patrick Amadeus Irawan
David Anugraha
Rifki Afina Putri
Yutong Wang
Adam Nohejl
Ubaidillah Ariq Prathama
N. Ousidhoum
Afifa Amriani
A. Rzayev
Anirban Das
Ashmari Pramodya
Aulia Adila
Bryan Wilie
Candy Olivia Mawalim
C. Cheng
Daud Abolade
Emmanuele Chersoni
Enrico Santus
Fariz Ikhwantri
Garry Kuwanto
Hanyang Zhao
Haryo Akbarianto Wibowo
Holy Lovenia
Jan Christian Blaise Cruz
Jan Wira Gotama Putra
Junho Myung
Lucky Susanto
Maria Angelica Riera Machin
Marina Zhukova
Michael Anugraha
Muhammad Farid Adilazuarda
Natasha Santosa
Peerat Limkonchotiwat
Raj Dabre
Rio Alexander Audino
Samuel Cahyawijaya
Shi-Xiong Zhang
S.
Yi Zhou
Yinxuan Gui
David Ifeoluwa Adelani
En-Shiun Annie Lee
Shogo Okada
Ayu Purwarianti
Alham Fikri Aji
Taro Watanabe
Derry Wijaya
Alice H. Oh
Chong-Wah Ngo
    CoGe
ArXivPDFHTML
Abstract

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

View on arXiv
@article{winata2025_2410.12705,
  title={ WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines },
  author={ Genta Indra Winata and Frederikus Hudi and Patrick Amadeus Irawan and David Anugraha and Rifki Afina Putri and Yutong Wang and Adam Nohejl and Ubaidillah Ariq Prathama and Nedjma Ousidhoum and Afifa Amriani and Anar Rzayev and Anirban Das and Ashmari Pramodya and Aulia Adila and Bryan Wilie and Candy Olivia Mawalim and Ching Lam Cheng and Daud Abolade and Emmanuele Chersoni and Enrico Santus and Fariz Ikhwantri and Garry Kuwanto and Hanyang Zhao and Haryo Akbarianto Wibowo and Holy Lovenia and Jan Christian Blaise Cruz and Jan Wira Gotama Putra and Junho Myung and Lucky Susanto and Maria Angelica Riera Machin and Marina Zhukova and Michael Anugraha and Muhammad Farid Adilazuarda and Natasha Santosa and Peerat Limkonchotiwat and Raj Dabre and Rio Alexander Audino and Samuel Cahyawijaya and Shi-Xiong Zhang and Stephanie Yulia Salim and Yi Zhou and Yinxuan Gui and David Ifeoluwa Adelani and En-Shiun Annie Lee and Shogo Okada and Ayu Purwarianti and Alham Fikri Aji and Taro Watanabe and Derry Tanti Wijaya and Alice Oh and Chong-Wah Ngo },
  journal={arXiv preprint arXiv:2410.12705},
  year={ 2025 }
}
Comments on this paper