Title |
---|
![]() End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting Yongqi Wang Xinxiao Wu Shuo Yang Jiebo Luo |
![]() EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model Feipeng Ma Yizhou Zhou Hebei Li Zilong He Siying Wu Fengyun Rao Siying Wu Fengyun Rao Yueyi Zhang Xiaoyan Sun |
![]() Visual Agents as Fast and Slow Thinkers Guangyan Sun Mingyu Jin Zhenting Wang Cheng-Long Wang Siqi Ma Qifan Wang Ying Nian Wu Ying Nian Wu Dongfang Liu Dongfang Liu |
![]() What If We Recaption Billions of Web Images with LLaMA-3? Xianhang Li Haoqin Tu Mude Hui Zeyu Wang Bingchen Zhao ...Jieru Mei Qing Liu Huangjie Zheng Yuyin Zhou Cihang Xie |
![]() LVBench: An Extreme Long Video Understanding Benchmark Weihan Wang Zehai He Wenyi Hong Yean Cheng Xiaohan Zhang ...Shiyu Huang Bin Xu Yuxiao Dong Ming Ding Jie Tang |