ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.13257
33
35

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

23 August 2024
Yi-Fan Zhang
Huanyu Zhang
Haochen Tian
Chaoyou Fu
Shuangqing Zhang
Junfei Wu
Feng Li
Kun Wang
Qingsong Wen
Zhang Zhang
L. Wang
Rong Jin
Tieniu Tan
    OffRL
ArXivPDFHTML
Abstract

Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution. To tackle these issues, we introduce MME-RealWorld. Specifically, we collect more than 300300300K images from public datasets and the Internet, filtering 13,36613,36613,366 high-quality images for annotation. This involves the efforts of professional 252525 annotators and 777 experts in MLLMs, contributing to 29,42929,42929,429 question-answer pairs that cover 434343 subtasks across 555 real-world scenarios, extremely challenging even for humans. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications. We further conduct a thorough evaluation involving 282828 prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that even the most advanced models struggle with our benchmarks, where none of them reach 60%60\%60% accuracy. The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed. The data and evaluation code are released atthis https URL.

View on arXiv
@article{zhang2025_2408.13257,
  title={ MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? },
  author={ Yi-Fan Zhang and Huanyu Zhang and Haochen Tian and Chaoyou Fu and Shuangqing Zhang and Junfei Wu and Feng Li and Kun Wang and Qingsong Wen and Zhang Zhang and Liang Wang and Rong Jin and Tieniu Tan },
  journal={arXiv preprint arXiv:2408.13257},
  year={ 2025 }
}
Comments on this paper