Title |
---|
![]() Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks Mengzhao Jia Wenhao Yu Kaixin Ma Tianqing Fang Zhihan Zhang Siru Ouyang Hongming Zhang Meng-Long Jiang Dong Yu |
![]() MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel ...Haoxuan You Zirui Wang Afshin Dehghan Peter Grasch Yinfei Yang |
![]() MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? Yi-Fan Zhang Huanyu Zhang Haochen Tian Chaoyou Fu Shuangqing Zhang ...Qingsong Wen Zhang Zhang L. Wang Rong Jin Tieniu Tan |
![]() EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model Feipeng Ma Yizhou Zhou Hebei Li Zilong He Siying Wu Fengyun Rao Siying Wu Fengyun Rao Yueyi Zhang Xiaoyan Sun |
![]() Visual Agents as Fast and Slow Thinkers Guangyan Sun Mingyu Jin Zhenting Wang Cheng-Long Wang Siqi Ma Qifan Wang Ying Nian Wu Ying Nian Wu Dongfang Liu Dongfang Liu |