Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.12681
Cited By
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
24 November 2021
Tsu-jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
W. Wang
Lijuan Wang
Zicheng Liu
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling"
50 / 169 papers shown
Title
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
34
0
0
04 Apr 2025
Quantum EigenGame for excited state calculation
David Quiroga
Jason Han
Anastasios Kyrillidis
48
0
0
17 Mar 2025
Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment
Xiaowei Bi
Zheyuan Xu
47
1
0
12 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
X. Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Z. Liu
EgoV
49
2
0
05 Mar 2025
Graph Perceiver IO: A General Architecture for Graph Structured Data
Seyun Bae
Hoyoon Byun
Changdae Oh
Yoon-Sik Cho
Kyungwoo Song
GNN
82
2
0
24 Feb 2025
GFG -- Gender-Fair Generation: A CALAMITA Challenge
Simona Frenda
Andrea Piergentili
Beatrice Savoldi
Marco Madeddu
Martina Rosola
Silvia Casola
Chiara Ferrando
V. Patti
Matteo Negri
L. Bentivogli
30
2
0
31 Dec 2024
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Pinelopi Papalampidi
Skanda Koppula
Shreya Pathak
Justin T Chiu
Joseph Heyward
Viorica Patraucean
Jiajun Shen
Antoine Miech
Andrew Zisserman
Aida Nematzdeh
VLM
54
23
0
31 Dec 2024
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
Yueqian Wang
Xiaojun Meng
Y. Wang
Jianxin Liang
Qun Liu
Dongyan Zhao
29
0
0
23 Dec 2024
GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning
Y. Wang
Zhikang Zhang
Jue Wang
D. Fan
Zhenlin Xu
Linda Liu
Xiang Hao
Vimal Bhat
Xinyu Li
VLM
69
1
0
10 Dec 2024
TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm
Bingqing Zhang
Zhuo Cao
Heming Du
Xin Yu
Xue Li
Jiajun Liu
Sen Wang
VGen
16
0
0
30 Sep 2024
VidLPRO: A
V
i
d
‾
\underline{Vid}
Vi
d
eo-
L
‾
\underline{L}
L
anguage
P
‾
\underline{P}
P
re-training Framework for
R
o
‾
\underline{Ro}
R
o
botic and Laparoscopic Surgery
Mohammadmahdi Honarmand
Muhammad Abdullah Jamal
Omid Mohareri
52
1
0
07 Sep 2024
Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
Jean Park
Kuk Jin Jang
Basam Alasaly
Sriharsha Mopidevi
Andrew Zolensky
Eric Eaton
Insup Lee
Kevin Johnson
20
4
0
22 Aug 2024
Macformer: Transformer with Random Maclaurin Feature Attention
Yuhan Guo
Lizhong Ding
Ye Yuan
Guoren Wang
35
0
0
21 Aug 2024
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality
Chaofan Tao
Gukyeong Kwon
Varad Gunjal
Hao Yang
Zhaowei Cai
Yonatan Dukler
Ashwin Swaminathan
R. Manmatha
Colin Jon Taylor
Stefano Soatto
CoGe
24
0
0
18 Aug 2024
VideoQA in the Era of LLMs: An Empirical Study
Junbin Xiao
Nanxin Huang
Hangyu Qin
Dongyang Li
Yicong Li
...
Zhulin Tao
Jianxing Yu
Liang Lin
Tat-Seng Chua
Angela Yao
21
9
0
08 Aug 2024
Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
Thong Nguyen
Yi Bin
Xiaobao Wu
Xinshuai Dong
Zhiyuan Hu
Khoi M. Le
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
24
5
0
04 Jul 2024
Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering
Zhaohe Liao
Jiangtong Li
Li Niu
Liqing Zhang
CoGe
24
3
0
03 Jul 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
M. Zhang
Tat-Seng Chua
Shuicheng Yan
AI4TS
26
37
0
27 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
31
9
1
09 Jun 2024
FILS: Self-Supervised Video Feature Prediction In Semantic Language Space
Mona Ahmadian
Frank Guerin
Andrew Gilbert
31
1
0
05 Jun 2024
VQA Training Sets are Self-play Environments for Generating Few-shot Pools
Tautvydas Misiunas
Hassan Mansoor
Jasper Uijlings
Oriana Riva
Victor Carbune
LRM
VLM
25
0
0
30 May 2024
Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei
Zixuan Pan
Andrew Owens
VLM
21
6
0
14 May 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
34
20
0
09 Apr 2024
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng
Mingfei Han
Haoyu He
Xiaojun Chang
Bohan Zhuang
VLM
45
56
0
04 Apr 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
19
104
0
22 Mar 2024
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul M. Chilimbi
VLM
AI4TS
34
4
0
21 Mar 2024
Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
Tianming Liang
Chaolei Tan
Beihao Xia
Wei-Shi Zheng
Jianfang Hu
22
1
0
21 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
47
7
0
21 Mar 2024
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
Joonmyung Choi
Sanghyeok Lee
Jaewon Chu
Minhyuk Choi
Hyunwoo J. Kim
MoMe
ViT
37
3
0
20 Mar 2024
VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework
Chris Kelly
Luhui Hu
Bang Yang
Yu Tian
Deshun Yang
Cindy Yang
Zaoshan Huang
Zihao Li
Jiayin Hu
Yuexian Zou
31
9
0
14 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLL
MoMe
20
10
0
13 Mar 2024
VideoMamba: State Space Model for Efficient Video Understanding
Kunchang Li
Xinhao Li
Yi Wang
Yinan He
Yali Wang
Limin Wang
Yu Qiao
Mamba
30
174
0
11 Mar 2024
LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form Video-Text Understanding
Yuxuan Wang
Yueqian Wang
Pengfei Wu
Jianxin Liang
Dongyan Zhao
Zilong Zheng
VLM
21
3
0
25 Feb 2024
Event-aware Video Corpus Moment Retrieval
Danyang Hou
Liang Pang
Huawei Shen
Xueqi Cheng
20
1
0
21 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
27
29
0
20 Feb 2024
LVCHAT: Facilitating Long Video Comprehension
Yu-Xiang Wang
Zeyuan Zhang
Julian McAuley
Zexue He
VLM
20
2
0
19 Feb 2024
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong
Qingpei Guo
Tian Gan
Qing Wang
Jianlong Wu
Xiangyuan Ren
Yuan-Chia Cheng
Wei Chu
13
5
0
31 Jan 2024
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
Yueqian Wang
Yuxuan Wang
Kai Chen
Dongyan Zhao
22
2
0
08 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai
Ruiping Wang
Xilin Chen
85
8
0
03 Jan 2024
Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering
Haopeng Li
Qiuhong Ke
Mingming Gong
Tom Drummond
21
1
0
03 Jan 2024
Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports
Haopeng Li
Andong Deng
Qiuhong Ke
Jun Liu
Hossein Rahmani
Yulan Guo
Mohammed Bennamoun
Chen Chen
31
17
0
03 Jan 2024
A Simple LLM Framework for Long-Range Video Question-Answering
Ce Zhang
Taixi Lu
Md. Mohaiminul Islam
Ziyang Wang
Shoubin Yu
Mohit Bansal
Gedas Bertasius
100
80
0
28 Dec 2023
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
17
36
0
11 Dec 2023
Zero-Shot Video Question Answering with Procedural Programs
Rohan Choudhury
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
13
21
0
01 Dec 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
30
1
0
30 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
21
28
0
29 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
27
0
0
28 Nov 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li
Yali Wang
Yinan He
Yizhuo Li
Yi Wang
...
Jilan Xu
Guo Chen
Ping Luo
Limin Wang
Yu Qiao
VLM
MLLM
22
398
0
28 Nov 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
24
3
0
25 Nov 2023
Vamos: Versatile Action Models for Video Understanding
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
21
19
0
22 Nov 2023
1
2
3
4
Next