ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.07032
224
2
v1v2v3 (latest)

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

8 June 2025
Bhuiyan Sanjid Shafique
Ashmal Vayani
Muhammad Maaz
H. Rasheed
Dinura Dissanayake
Mohammed Irfan Kurpath
Yahya Hmaiti
Go Inoue
Jean Lahoud
Md. Safirur Rashid
Shadid Intisar Quasem
Maheen Fatima
Franco Vidal
Mykola Maslych
Ketan More
Sanoojan Baliah
Hasindri Watawana
Yuhao Li
Fabian Farestam
Leon Schaller
Roman Tymtsiv
Simon Weber
Hisham Cholakkal
Ivan Laptev
Shiníchi Satoh
Michael Felsberg
M. Shah
Salman Khan
Fahad Shahbaz Khan
    VLM
ArXiv (abs)PDFHTML
Main:9 Pages
24 Figures
Bibliography:4 Pages
5 Tables
Appendix:13 Pages
Abstract

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at this https URL.

View on arXiv
Comments on this paper