Long Context Transfer from Language to Vision Peiyuan Zhang Kaichen Zhang Bo Li Guangtao Zeng Jingkang Yang Yuanhan Zhang Ziyue Wang Haoran Tan Chunyuan Li Ziwei Liu |
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs Yuxuan Qiao Haodong Duan Xinyu Fang Junming Yang Lin Chen Songyang Zhang Jiaqi Wang Dahua Lin Kai Chen |
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report Franz Louis Cesista |
Needle In A Multimodal Haystack Weiyun Wang Shuibo Zhang Yiming Ren Yuchen Duan Tiantong Li ...Ping Luo Yu Qiao Jifeng Dai Wenqi Shao Wenhai Wang |
VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded TextInternational Conference on Learning Representations (ICLR), 2024 |
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and
Effective for LMMsNeural Information Processing Systems (NeurIPS), 2024 |
Enhancing Descriptive Image Quality Assessment with A Large-scale Multi-modal DatasetIEEE Transactions on Image Processing (TIP), 2024 |