Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2112.09583
Cited By
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
17 December 2021
Dongxu Li
Junnan Li
Hongdong Li
Juan Carlos Niebles
S. Hoi
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Align and Prompt: Video-and-Language Pre-training with Entity Prompts"
50 / 138 papers shown
Title
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
29
28
0
29 Oct 2023
JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues
Jiayi Ji
Haowei Wang
Changli Wu
Yiwei Ma
Xiaoshuai Sun
Rongrong Ji
35
1
0
14 Oct 2023
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu
Anahita Bhiwandiwalla
Vasudev Lal
VLM
29
0
0
07 Oct 2023
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Bipin Rajendran
Bashir M. Al-Hashimi
MLLM
VLM
26
2
0
27 Sep 2023
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
13
26
0
25 Sep 2023
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
Chen Jiang
Hong Liu
Xuzheng Yu
Qing Wang
Yuan-Chia Cheng
...
Zhongyi Liu
Qingpei Guo
Wei Chu
Ming Yang
Yuan Qi
16
10
0
20 Sep 2023
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Danae Sánchez Villegas
Daniel Preoctiuc-Pietro
Nikolaos Aletras
31
2
0
14 Sep 2023
Incorporating Pre-trained Model Prompting in Multimodal Stock Volume Movement Prediction
Ruibo Chen
Zhiyuan Zhang
Yi Liu
Ruihan Bao
Keiko Harimoto
Xu Sun
AIFin
AI4TS
23
0
0
11 Sep 2023
CoVR: Learning Composed Video Retrieval from Web Video Captions
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
19
21
0
28 Aug 2023
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment
Shengxiang Zhang
Muzammal Naseer
Guangyi Chen
Zhiqiang Shen
Salman Khan
Kun Zhang
F. Khan
VLM
56
4
0
24 Aug 2023
Multi-event Video-Text Retrieval
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
19
13
0
22 Aug 2023
Simple Baselines for Interactive Video Retrieval with Questions and Answers
Kaiqu Liang
Samuel Albanie
22
2
0
21 Aug 2023
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Dohwan Ko
Ji Soo Lee
M. Choi
Jaewon Chu
Jihwan Park
Hyunwoo J. Kim
20
5
0
18 Aug 2023
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
Chaorui Deng
Qi Chen
Pengda Qin
Dave Zhenyu Chen
Qi Wu
VLM
CLIP
36
29
0
15 Aug 2023
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
Jiazhou Zhou
Xueye Zheng
Yuanhuiyi Lyu
Lin Wang
VLM
17
12
0
06 Aug 2023
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
Haowei Wang
Jiji Tang
Jiayi Ji
Xiaoshuai Sun
Rongsheng Zhang
...
Minda Zhao
Lincheng Li
zeng zhao
Tangjie Lv
R. Ji
3DV
21
13
0
06 Aug 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
30
42
0
30 Jul 2023
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
Yi Cheng
Hehe Fan
Dongyun Lin
Ying Sun
Mohan S. Kankanhalli
J. Lim
32
4
0
25 Jul 2023
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
Sarah Ibrahimi
Xiaohang Sun
Pichao Wang
Amanmeet Garg
Ashutosh Sanan
Mohamed Omar
44
14
0
24 Jul 2023
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
162
1
0
14 Jul 2023
Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning
Gengyuan Zhang
Yurui Zhang
Kerui Zhang
Volker Tresp
LRM
22
10
0
12 Jul 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
J. Liu
VLM
CLIP
22
8
0
15 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language Pre-training
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
10
2
0
12 Jun 2023
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
Haiyang Xu
Qinghao Ye
Xuan-Wei Wu
Mingshi Yan
Yuan Miao
...
Qingfang Qian
Maofei Que
Ji Zhang
Xiaoyan Zeng
Feiyan Huang
VLM
MLLM
38
21
0
07 Jun 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
J. Liu
30
96
0
29 May 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
24
9
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
J. Liu
Jiashi Feng
VLM
CLIP
18
17
0
22 May 2023
Segment Any Anomaly without Training via Hybrid Prompt Regularization
Yunkang Cao
Xiaohao Xu
Chen Sun
Y. Cheng
Zongwei Du
Liang Gao
Weiming Shen
VLM
26
69
0
18 May 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang
Ansel Blume
Sha Li
Genglin Liu
Jaemin Cho
Zineng Tang
Mohit Bansal
Heng Ji
KELM
VGen
17
26
0
18 May 2023
TG-VQA: Ternary Game of Video Question Answering
Hao Li
Peng Jin
Ze-Long Cheng
Songyang Zhang
Kai-xiang Chen
Zhennan Wang
Chang-rui Liu
Jie Chen
21
10
0
17 May 2023
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
Haixin Wang
Xinlong Yang
Jianlong Chang
Di Jin
Jinan Sun
Shikun Zhang
Xiao Luo
Qi Tian
22
22
0
15 May 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
17
12
0
18 Apr 2023
Chain of Thought Prompt Tuning in Vision Language Models
Jiaxin Ge
Hongyin Luo
Siyuan Qian
Yulu Gan
Jie Fu
Shanghang Zhang
VLM
LRM
MLLM
30
27
0
16 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
23
37
0
09 Apr 2023
Procedure-Aware Pretraining for Instructional Video Understanding
Honglu Zhou
Roberto Martín-Martín
Mubbasir Kapadia
Silvio Savarese
Juan Carlos Niebles
23
38
0
31 Mar 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
19
0
0
28 Mar 2023
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Kunchang Li
Yali Wang
Yizhuo Li
Yi Wang
Yinan He
Limin Wang
Yu Qiao
VGen
30
154
0
28 Mar 2023
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation
Ji Qi
Jifan Yu
Teng Tu
Kunyu Gao
Yifan Xu
...
Juanzi Li
Jie Tang
Weidong Guo
Hui Liu
Yu-Syuan Xu
20
19
0
26 Mar 2023
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
Dohwan Ko
Joon-Young Choi
Hyeong Kyu Choi
Kyoung-Woon On
Byungseok Roh
Hyunwoo J. Kim
44
18
0
23 Mar 2023
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Junaid Qadir
35
46
0
21 Mar 2023
3D Concept Learning and Reasoning from Multi-View Images
Yining Hong
Chun-Tse Lin
Yilun Du
Zhenfang Chen
J. Tenenbaum
Chuang Gan
3DV
20
51
0
20 Mar 2023
Improving Music Genre Classification from Multi-Modal Properties of Music and Genre Correlations Perspective
Ganghui Ru
Xulong Zhang
Jianzong Wang
Ning Cheng
Jing Xiao
14
1
0
14 Mar 2023
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Jiaqi Xu
Bo Liu
Yunkuo Chen
Mengli Cheng
Xing Shi
28
1
0
10 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
18
220
0
27 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
25
8
0
20 Feb 2023
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Raghav Goyal
E. Mavroudi
Xitong Yang
Sainbayar Sukhbaatar
Leonid Sigal
Matt Feiszli
Lorenzo Torresani
Du Tran
8
7
0
16 Feb 2023
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
Haoyu Lu
Yuqi Huo
Guoxing Yang
Zhiwu Lu
Wei Zhan
M. Tomizuka
Mingyu Ding
25
31
0
13 Feb 2023
Is Multimodal Vision Supervision Beneficial to Language?
Avinash Madasu
Vasudev Lal
29
4
0
10 Feb 2023
Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer
Min Peng
Chongyang Wang
Yu Shi
Xiang-Dong Zhou
ViT
42
7
0
04 Feb 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu
Qinghao Ye
Mingshi Yan
Yaya Shi
Jiabo Ye
...
Guohai Xu
Ji Zhang
Songfang Huang
Feiran Huang
Jingren Zhou
MLLM
VLM
MoE
23
158
0
01 Feb 2023
Previous
1
2
3
Next