Culture In a Frame: CB as a Comic-Based Benchmark for Multimodal Culturally Awareness
Cultural awareness capabilities have emerged as a critical capability for Multimodal Large Language Models (MLLMs). However, current benchmarks lack progressed difficulty in their task design and are deficient in cross-lingual tasks. Moreover, current benchmarks often use real-world images. Each real-world image typically contains one culture, making these benchmarks relatively easy for MLLMs. Based on this, we propose CB (Comics Cross-Cultural Benchmark), a novel multicultural, multitask and multilingual cultural awareness capabilities benchmark. CB comprises over 2000 images and over 18000 QA pairs, constructed on three tasks with progressed difficulties, from basic visual recognition to higher-level cultural conflict understanding, and finally to cultural content generation. We conducted evaluations on 11 open-source MLLMs, revealing a significant performance gap between MLLMs and human performance. The gap demonstrates that CB poses substantial challenges for current MLLMs, encouraging future research to advance the cultural awareness capabilities of MLLMs.
View on arXiv