Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1904.01766
Cited By
v1
v2 (latest)
VideoBERT: A Joint Model for Video and Language Representation Learning
3 April 2019
Chen Sun
Austin Myers
Carl Vondrick
Kevin Patrick Murphy
Cordelia Schmid
VLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VideoBERT: A Joint Model for Video and Language Representation Learning"
50 / 803 papers shown
MGeo: Multi-Modal Geographic Pre-Training Method
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2023
Ruixue Ding
Boli Chen
Pengjun Xie
Fei Huang
Xin Li
Qiang-Wei Zhang
Yao Xu
273
29
0
11 Jan 2023
Universal Multimodal Representation for Language Understanding
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Zhuosheng Zhang
Kehai Chen
Rui Wang
Masao Utiyama
Eiichiro Sumita
Z. Li
Hai Zhao
SSL
291
30
0
09 Jan 2023
MAQA: A Multimodal QA Benchmark for Negation
Judith Yue Li
Aren Jansen
Qingqing Huang
Joonseok Lee
Ravi Ganti
Dima Kuzmin
216
7
0
09 Jan 2023
Logically at Factify 2: A Multi-Modal Fact Checking System Based on Evidence Retrieval techniques and Transformer Encoder Architecture
P. Verschuuren
Jie Gao
A. V. Eeden
Stylianos Oikonomou
Anil Bandhakavi
269
2
0
09 Jan 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Computer Vision and Pattern Recognition (CVPR), 2023
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
465
47
0
05 Jan 2023
Learning Trajectory-Word Alignments for Video-Language Tasks
IEEE International Conference on Computer Vision (ICCV), 2023
Xu Yang
Zhang Li
Haiyang Xu
Hanwang Zhang
Qinghao Ye
Chenliang Li
Ming Yan
Yu Zhang
Fei Huang
Songfang Huang
215
7
0
05 Jan 2023
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Computer Vision and Pattern Recognition (CVPR), 2023
Santhosh Kumar Ramakrishnan
Ziad Al-Halah
Kristen Grauman
372
47
0
02 Jan 2023
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
IEEE International Conference on Computer Vision (ICCV), 2022
Qinghao Ye
Guohai Xu
Ming Yan
Haiyang Xu
Qi Qian
Ji Zhang
Fei Huang
VLM
AI4TS
547
91
0
30 Dec 2022
Emotion Recognition with Pre-Trained Transformers Using Multimodal Signals
Affective Computing and Intelligent Interaction (ACII), 2022
Juan Vazquez-Rodriguez
G. Lefebvre
Julien Cumin
James L. Crowley
195
15
0
22 Dec 2022
VindLU: A Recipe for Effective Video-and-Language Pretraining
Computer Vision and Pattern Recognition (CVPR), 2022
Feng Cheng
Xizi Wang
Jie Lei
David J. Crandall
Joey Tianyi Zhou
Gedas Bertasius
VLM
290
92
0
09 Dec 2022
Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation
IEEE Access (IEEE Access), 2022
Jie Jiang
Zhimin Li
Jiangfeng Xiong
Rongwei Quan
Qinglin Lu
Wei Liu
199
3
0
09 Dec 2022
Learning Video Representations from Large Language Models
Computer Vision and Pattern Recognition (CVPR), 2022
Yue Zhao
Ishan Misra
Philipp Krahenbuhl
Rohit Girdhar
VLM
AI4TS
307
231
0
08 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
169
30
0
07 Dec 2022
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang
Kunchang Li
Yizhuo Li
Yinan He
Bingkun Huang
...
Junting Pan
Jiashuo Yu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
466
448
0
06 Dec 2022
Muscles in Action
IEEE International Conference on Computer Vision (ICCV), 2022
Mia Chiquier
Carl Vondrick
319
1
0
05 Dec 2022
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
International Journal of Computer Vision (IJCV), 2022
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
303
2
0
05 Dec 2022
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Fangxun Shu
Biaolong Chen
Yue Liao
Shuwen Xiao
Wenyu Sun
Xiaobo Li
Yousong Zhu
Jinqiao Wang
Si Liu
CLIP
190
13
0
02 Dec 2022
Protein Language Models and Structure Prediction: Connection and Progression
Bozhen Hu
Jun Xia
Jiangbin Zheng
Cheng Tan
Yufei Huang
Yongjie Xu
Stan Z. Li
220
46
0
30 Nov 2022
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation
International Conference on Machine Learning (ICML), 2022
Huaishao Luo
Junwei Bao
Youzheng Wu
Xiaodong He
Tianrui Li
VLM
248
196
0
27 Nov 2022
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022
Zineng Tang
Jaemin Cho
Jie Lei
Joey Tianyi Zhou
VLM
179
10
0
21 Nov 2022
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Neural Information Processing Systems (NeurIPS), 2022
Peng Jin
Jinfa Huang
Fenglin Liu
Xian Wu
Shen Ge
Guoli Song
David Clifton
Jing Chen
VLM
307
87
0
21 Nov 2022
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
Computer Vision and Pattern Recognition (CVPR), 2022
Weijie Su
Xizhou Zhu
Chenxin Tao
Lewei Lu
Bin Li
Gao Huang
Yu Qiao
Xiaogang Wang
Jie Zhou
Jifeng Dai
245
56
0
17 Nov 2022
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Neural Information Processing Systems (NeurIPS), 2022
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
266
128
0
17 Nov 2022
Cross-Modal Adapter for Vision-Language Retrieval
Pattern Recognition (Pattern Recogn.), 2022
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Gao Huang
368
43
0
17 Nov 2022
Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Daniel Fried
Nicholas Tomlin
Jennifer Hu
Roma Patel
Aida Nematzadeh
250
9
0
15 Nov 2022
Grafting Pre-trained Models for Multimodal Headline Generation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Lingfeng Qiao
Chen Wu
Ye Liu
Haoyuan Peng
Di Yin
Bo Ren
248
6
0
14 Nov 2022
CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
ACM Multimedia (ACM MM), 2022
Guohao Li
Hu Yang
Feng He
Zhifan Feng
Yajuan Lyu
Hua Wu
Haifeng Wang
VLM
178
2
0
07 Nov 2022
CASA: Category-agnostic Skeletal Animal Reconstruction
Neural Information Processing Systems (NeurIPS), 2022
Yuefan Wu
Ze-Yin Chen
Shao-Wei Liu
Zhongzheng Ren
Shenlong Wang
261
41
0
04 Nov 2022
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Neural Information Processing Systems (NeurIPS), 2022
Junru Wu
Yi Liang
Feng Han
Hassan Akbari
Zinan Lin
Cong Yu
153
14
0
03 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
ACM Transactions on Knowledge Discovery from Data (TKDD), 2021
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
209
13
0
28 Oct 2022
End-to-End Multimodal Representation Learning for Video Dialog
Huda AlAmri
Anthony Bilic
Michael Hu
Apoorva Beedu
Irfan Essa
213
7
0
26 Oct 2022
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yuechen Wang
Wen-gang Zhou
Houqiang Li
AI4TS
155
14
0
21 Oct 2022
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Dongsheng Chen
Chaofan Tao
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
VLM
251
19
0
21 Oct 2022
H4VDM: H.264 Video Device Matching
Ziyue Xiang
Paolo Bestagini
Stefano Tubaro
Edward J. Delp
116
1
0
20 Oct 2022
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yu Zhao
Jianguo Wei
Zhichao Lin
Yueheng Sun
Meishan Zhang
Hao Fei
193
17
0
20 Oct 2022
Grounded Video Situation Recognition
Neural Information Processing Systems (NeurIPS), 2022
Zeeshan Khan
C. V. Jawahar
Makarand Tapaswi
192
16
0
19 Oct 2022
VTC: Improving Video-Text Retrieval with User Comments
European Conference on Computer Vision (ECCV), 2022
Laura Hanu
James Thewlis
Yuki M. Asano
Christian Rupprecht
VGen
245
8
0
19 Oct 2022
Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy
Shiyuan Huang
Robinson Piramuthu
Shih-Fu Chang
Gunnar Sigurdsson
202
1
0
15 Oct 2022
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2022
Wenliang Dai
Zihan Liu
Ziwei Ji
Jane Polak Scowcroft
Pascale Fung
MLLM
VLM
313
76
0
14 Oct 2022
Can Language Representation Models Think in Bets?
Royal Society Open Science (RSOS), 2022
Zhi–Bin Tang
Mayank Kejriwal
159
7
0
14 Oct 2022
RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Xing Wu
Chaochen Gao
Zijia Lin
Zhongyuan Wang
Jizhong Han
Songlin Hu
162
10
0
13 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Neural Information Processing Systems (NeurIPS), 2022
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
297
84
0
12 Oct 2022
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang
Yujie Zhong
Yishu Miao
Lin Ma
Lucia Specia
233
15
0
10 Oct 2022
Generating Executable Action Plans with Environmentally-Aware Language Models
IEEE/RJS International Conference on Intelligent RObots and Systems (IROS), 2022
Maitrey Gramopadhye
D. Szafir
LM&Ro
LLMAG
325
38
0
10 Oct 2022
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
Asian Conference on Computer Vision (ACCV), 2022
A. Fragomeni
Michael Wray
Dima Damen
CLIP
ViT
158
4
0
09 Oct 2022
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
British Machine Vision Conference (BMVC), 2022
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
312
2
0
08 Oct 2022
See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction
Maria Attarian
Advaya Gupta
Ziyi Zhou
Wei Yu
Igor Gilitschenski
Animesh Garg
LM&Ro
213
8
0
07 Oct 2022
Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Findings (Findings), 2022
Wanrong Zhu
An Yan
Yujie Lu
Wenda Xu
Xinze Wang
Miguel P. Eckstein
William Yang Wang
324
38
0
07 Oct 2022
Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes
Ke Shen
Mayank Kejriwal
192
4
0
03 Oct 2022
CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training
IEEE International Conference on Computer Vision (ICCV), 2022
Tianyu Huang
Bowen Dong
Yunhan Yang
Xiaoshui Huang
Rynson W. H. Lau
Wanli Ouyang
W. Zuo
VLM
3DPC
CLIP
580
199
0
03 Oct 2022
Previous
1
2
3
...
6
7
8
...
15
16
17
Next
Page 7 of 17
Page
of 17
Go