Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2302.00402
Cited By
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
International Conference on Machine Learning (ICML), 2023
1 February 2023
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
Yuanhong Xu
Chenliang Li
Bin Bi
Qiuchen Qian
Wei Wang
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLM
VLM
MoE
Re-assign community
ArXiv (abs)
PDF
HTML
Github (2045★)
Papers citing
"mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video"
50 / 123 papers shown
MemVerse: Multimodal Memory for Lifelong Learning Agents
J. Liu
Yifei Sun
Weihua Cheng
Haodong Lei
Yirong Chen
...
Nianchen Deng
Yi Yu
Shuyue Hu
Botian Shi
Ding Wang
KELM
275
9
0
03 Dec 2025
Axial Neural Networks for Dimension-Free Foundation Models
Hyunsu Kim
Jonggeon Park
Joan Bruna
Hongseok Yang
Juho Lee
AI4CE
209
0
0
15 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
252
1
0
12 Oct 2025
A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity
Giordano Cicchetti
Eleonora Grassucci
Danilo Comminiello
204
4
0
29 Sep 2025
Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
Nisarg A. Shah
Amir Ziai
Chaitanya Ekanadham
Vishal M. Patel
VGen
CoGe
ELM
178
0
0
17 Sep 2025
Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval
Bangxiang Lan
Ruobing Xie
Ruixiang Zhao
Xingwu Sun
Zhanhui Kang
Gang Yang
Xirong Li
192
2
0
05 Sep 2025
VQualA 2025 Challenge on Engagement Prediction for Short Videos: Methods and Results
Dasong Li
Sizhuo Ma
Hang Hua
W. Li
Jian Wang
...
Yunlong Tang
Luchuan Song
Jinxi He
J. Wu
Hanjia Lyu
140
12
0
03 Sep 2025
VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding
Baoyao yang
Wanyun Li
Dixin Chen
Junxiang Chen
Wenbin Yao
Haifeng Lin
VGen
193
0
0
24 Jul 2025
Principled Multimodal Representation Learning
Xiaohao Liu
Xiaobo Xia
See-Kiong Ng
Tat-Seng Chua
362
11
0
23 Jul 2025
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
Peiran Wu
Yunze Liu
Zhengdong Zhu
Enmin Zhou
Junxiao Shen
272
8
0
15 Jul 2025
Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
Wenhao Li
Xiu Su
Jingyi Wu
Feng Yang
Yang-Yang Liu
Yi-Ling Chen
Shan You
Chang Xu
VLM
284
0
0
07 Jul 2025
Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
Shubhashis Roy Dipta
Francis Ferraro
342
1
0
11 Jun 2025
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Benjamin Z. Reichman
Constantin Patsch
Jack Truxal
Atishay Jain
Larry Heck
250
0
0
11 Jun 2025
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
Jinyoung Park
Jeehye Na
Jinyoung Kim
H. Kim
OffRL
458
38
0
09 Jun 2025
Understanding Complexity in VideoQA via Visual Program Generation
Cristobal Eyzaguirre
Igor Vasiljevic
Achal Dave
Jiajun Wu
Rares Andrei Ambrus
Thomas Kollar
Juan Carlos Niebles
P. Tokmakov
335
0
0
19 May 2025
AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
Junli Liu
Qizhi Chen
Zechuan Wang
Yiwen Tang
Yiting Zhang
Chi Yan
Dong Wang
Xiaochen Li
Jiangwei Zhong
CoGe
627
11
0
10 Apr 2025
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
Sakib Reza
Xiyun Song
Heather Yu
Zongfang Lin
Mohsen Moghaddam
Mario Sznaier
322
0
0
07 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
370
10
0
07 Apr 2025
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
International Journal of Computer Vision (IJCV), 2024
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
500
3
0
03 Apr 2025
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jinfa Huang
Jie Lou
Debing Zhang
Rongrong Ji
670
8
0
26 Mar 2025
Can Text-to-Video Generation help Video-Language Alignment?
Computer Vision and Pattern Recognition (CVPR), 2025
Luca Zanella
Goran Frehse
Willi Menapace
Sergey Tulyakov
Yiming Wang
Elisa Ricci
DiffM
VGen
372
1
0
24 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2025
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
1.1K
15
0
11 Mar 2025
Towards Fine-Grained Video Question Answering
Wei Dai
Alan Luo
Zane Durante
Debadutta Dash
Arnold Milstein
Kevin Schulman
Ehsan Adeli
L. Fei-Fei
329
1
0
10 Mar 2025
IteRPrimE: Zero-shot Referring Image Segmentation with Iterative Grad-CAM Refinement and Primary Word Emphasis
AAAI Conference on Artificial Intelligence (AAAI), 2025
Yun Wang
Jingchen Ni
Yong-Jin Liu
Chun Yuan
Yansong Tang
367
20
0
02 Mar 2025
Pretrained Image-Text Models are Secretly Video Captioners
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Chunhui Zhang
Yiren Jian
Z. Ouyang
Soroush Vosoughi
VLM
598
15
0
20 Feb 2025
Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
EGVM
1.2K
0
0
18 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
The Web Conference (WWW), 2025
Guobing Gan
Kaiming Gao
Li Wang
Shen Jiang
Peng Jiang
321
3
0
09 Feb 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Computer Vision and Pattern Recognition (CVPR), 2025
Rui Qian
Shuangrui Ding
Xiaoyi Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Dahua Lin
Jiaqi Wang
373
65
0
06 Jan 2025
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
Yi Yuan
Dongya Jia
Xiaobin Zhuang
Yuanzhe Chen
Zhengxi Liu
...
Longji Xu
Xubo Liu
Xiyuan Kang
Mark D. Plumbley
Wenwu Wang
VLM
466
4
0
03 Jan 2025
Do Language Models Understand Time?
The Web Conference (WWW), 2024
Xi Ding
Lei Wang
1.0K
13
0
18 Dec 2024
Gramian Multimodal Representation Learning and Alignment
International Conference on Learning Representations (ICLR), 2024
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
552
41
0
16 Dec 2024
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Computer Vision and Pattern Recognition (CVPR), 2024
Hang Hua
Qing Liu
Lingzhi Zhang
Jing Shi
Zhifei Zhang
Yilin Wang
Jianming Zhang
Jiebo Luo
CoGe
VLM
403
26
0
23 Nov 2024
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
669
6
0
14 Nov 2024
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures
Tianxiang Wu
Minxin Nie
Ziqiang Cao
MLLM
172
0
0
30 Oct 2024
Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors
Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2024
Wenqiang Chen
Jiaxuan Cheng
Leyao Wang
Wei Zhao
Wojciech Matusik
340
20
0
26 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
International Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
826
118
0
04 Oct 2024
Delving Deep into Engagement Prediction of Short Videos
European Conference on Computer Vision (ECCV), 2024
Dasong Li
Wenjie Li
Baili Lu
Hongsheng Li
Sizhuo Ma
Gurunandan Krishnan
Jian Wang
459
9
0
30 Sep 2024
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
Neural Information Processing Systems (NeurIPS), 2024
Ming Dai
Lingfeng Yang
Yihao Xu
Zhenhua Feng
Wankou Yang
ObjD
492
48
0
26 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
474
6
0
19 Sep 2024
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
AAAI Conference on Artificial Intelligence (AAAI), 2024
Yiyi Zhou
Qiong Wu
Wenhao Lin
Weihao Ye
VLM
354
87
0
16 Sep 2024
Enhancing Long Video Understanding via Hierarchical Event-Based Memory
Dingxin Cheng
Mingda Li
Jingyu Liu
Yongxin Guo
Bin Jiang
Qingbin Liu
Xi Chen
Bo Zhao
317
15
0
10 Sep 2024
IVGF: The Fusion-Guided Infrared and Visible General Framework
Fangcen Liu
Chenqiang Gao
Fang Chen
Pengcheng Li
Junjie Guo
Deyu Meng
441
1
0
02 Sep 2024
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
Neural Information Processing Systems (NeurIPS), 2024
Yiwei Ma
Jiayi Ji
Ke Ye
Weihuang Lin
Zhibin Wang
Yonghan Zheng
Qiang-feng Zhou
Xiaoshuai Sun
Rongrong Ji
345
41
0
26 Aug 2024
T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval
ACM Multimedia (MM), 2024
Yili Li
Jing Yu
Keke Gai
Bang Liu
Gang Xiong
Qi Wu
DiffM
VGen
271
6
0
21 Aug 2024
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval
Thomas Hummel
Shyamgopal Karthik
Mariana-Iuliana Georgescu
Zeynep Akata
EgoV
474
24
0
23 Jul 2024
WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding
Quan Kong
Yuki Kawana
Rajat Saini
Ashutosh Kumar
Jingjing Pan
...
Yohei Ozao
Balázs Opra
D. Anastasiu
Yoichi Sato
Norimasa Kobori
VGen
212
25
0
22 Jul 2024
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang
Liping Yuan
Yuchen Zhang
345
135
0
30 Jun 2024
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
Aditya Sharma
Michael Saxon
William Yang Wang
VLM
306
13
0
24 Jun 2024
UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos
Yuting Mei
Linli Yao
Qin Jin
255
3
0
24 Jun 2024
Long Story Short: Story-level Video Understanding from 20K Short Films
Ridouane Ghermi
Xi Wang
Vicky Kalogeiton
Ivan Laptev
VGen
253
2
0
14 Jun 2024
1
2
3
Next
Page 1 of 3