Title |
---|
![]() MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Yi-Fan Zhang Huanyu Zhang Haochen Tian Chaoyou Fu Shuangqing Zhang ...Qingsong Wen Zhang Zhang L. Wang Rong Jin Tieniu Tan |
![]() Visual Agents as Fast and Slow Thinkers Guangyan Sun Mingyu Jin Zhenting Wang Cheng-Long Wang Siqi Ma Qifan Wang Ying Nian Wu Ying Nian Wu Dongfang Liu Dongfang Liu |
![]() MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong ...Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin Jiaqi Wang |
![]() ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Cheng Yang Chufan Shi Yaxin Liu Bo Shui Junjie Wang ...Yuxiang Zhang Gongye Liu Xiaomei Nie Deng Cai Yujiu Yang |
![]() What If We Recaption Billions of Web Images with LLaMA-3? Xianhang Li Haoqin Tu Mude Hui Zeyu Wang Bingchen Zhao ...Jieru Mei Qing Liu Huangjie Zheng Yuyin Zhou Cihang Xie |