Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.12681
Cited By
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
24 November 2021
Tsu-jui Fu
Linjie Li
Zhe Gan
Kevin Qinghong Lin
W. Wang
Lijuan Wang
Zicheng Liu
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling"
50 / 169 papers shown
Title
UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework
Chris Kelly
Luhui Hu
Cindy Yang
Yu Tian
Deshun Yang
Bang Yang
Zaoshan Huang
Zihao Li
Yuexian Zou
AI4TS
17
3
0
16 Nov 2023
Multi Sentence Description of Complex Manipulation Action Videos
Fatemeh Ziaeetabar
Reza Safabakhsh
S. Momtazi
M. Tamosiunaite
F. Worgotter
13
1
0
13 Nov 2023
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
.Ilker Kesen
Andrea Pedrotti
Mustafa Dogan
Michele Cafagna
Emre Can Acikgoz
...
Iacer Calixto
Anette Frank
Albert Gatt
Aykut Erdem
Erkut Erdem
20
15
0
13 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
20
19
0
09 Nov 2023
ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Te-Lin Wu
Zi-Yi Dou
Qingyuan Hu
Yu Hou
Nischal Reddy Chandra
Marjorie Freedman
R. Weischedel
Nanyun Peng
16
5
0
02 Nov 2023
Long Story Short: a Summarize-then-Search Method for Long Video Question Answering
Jiwan Chung
Youngjae Yu
78
5
0
02 Nov 2023
MM-VID: Advancing Video Understanding with GPT-4V(ision)
Kevin Qinghong Lin
Faisal Ahmed
Linjie Li
Chung-Ching Lin
E. Azarnasab
...
Lin Liang
Zicheng Liu
Yumao Lu
Ce Liu
Lijuan Wang
MLLM
10
62
0
30 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
25
2
0
30 Oct 2023
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren
Sishuo Chen
Shicheng Li
Xu Sun
Lu Hou
ViT
27
28
0
29 Oct 2023
ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages
Mohammad Akbari
Saeed Ranjbar Alvar
Behnam Kamranian
Amin Banitalebi-Dehkordi
Yong Zhang
AI4CE
12
0
0
26 Oct 2023
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
Huihui Gong
Minjing Dong
Siqi Ma
S. Çamtepe
Chang Xu
Lei Hou
Surya Nepal
VLM
MLLM
37
0
0
16 Oct 2023
Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks
Avinash Madasu
Anahita Bhiwandiwalla
Vasudev Lal
VLM
21
0
0
07 Oct 2023
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Bipin Rajendran
Bashir M. Al-Hashimi
MLLM
VLM
23
2
0
27 Sep 2023
Distraction-free Embeddings for Robust VQA
Atharvan Dogra
Deeksha Varshney
A. Kalyan
A. Deshpande
Neeraj Kumar
11
0
0
31 Aug 2023
Multi-event Video-Text Retrieval
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
11
7
0
22 Aug 2023
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Dohwan Ko
Ji Soo Lee
M. Choi
Jaewon Chu
Jihwan Park
Hyunwoo J. Kim
12
5
0
18 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Jenq-Neng Hwang
Gaoang Wang
VLM
MLLM
6
259
0
31 Jul 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
16
42
0
30 Jul 2023
Discovering Spatio-Temporal Rationales for Video Question Answering
Yicong Li
Junbin Xiao
Chun Feng
Xiang Wang
Tat-Seng Chua
11
12
0
22 Jul 2023
Traffic-Domain Video Question Answering with Automatic Captioning
Ehsan Qasemi
Jonathan M Francis
A. Oltramari
16
8
0
18 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
11
241
0
13 Jul 2023
Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning
Gengyuan Zhang
Yurui Zhang
Kerui Zhang
Volker Tresp
LRM
14
10
0
12 Jul 2023
Mx2M: Masked Cross-Modality Modeling in Domain Adaptation for 3D Semantic Segmentation
Boxiang Zhang
Zunran Wang
Yonggen Ling
Yuanyuan Guan
Shenghao Zhang
Wenhui Li
25
5
0
09 Jul 2023
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Wei Han
Hui Chen
MingSung Kan
Soujanya Poria
16
1
0
09 Jul 2023
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models
Avinash Madasu
Vasudev Lal
CoGe
24
3
0
28 Jun 2023
FunQA: Towards Surprising Video Comprehension
Binzhu Xie
Sicheng Zhang
Zitang Zhou
Bo-wen Li
Yuanhan Zhang
Jack Hessel
Jingkang Yang
Ziwei Liu
23
20
0
26 Jun 2023
AudioPaLM: A Large Language Model That Can Speak and Listen
Paul Kishan Rubenstein
Chulayuth Asawaroengchai
D. Nguyen
Ankur Bapna
Zalan Borsos
...
Neil Zeghidour
Yu Zhang
Zhishuai Zhang
Lukás Zilka
Christian Frank
LM&MA
AuLLM
VLM
30
256
0
22 Jun 2023
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Junting Pan
Ziyi Lin
Yuying Ge
Xiatian Zhu
Renrui Zhang
Yi Wang
Yu Qiao
Hongsheng Li
MLLM
16
15
0
15 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
J. Liu
VLM
CLIP
14
5
0
15 Jun 2023
arXiVeri: Automatic table verification with GPT
Gyungin Shin
Weidi Xie
Samuel Albanie
LMTD
20
1
0
13 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language Pre-training
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
10
1
0
12 Jun 2023
A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks
Saidul Islam
Hanae Elmekki
Ahmed Elsebai
Jamal Bentahar
Najat Drawel
Gaith Rjoub
Witold Pedrycz
ViT
MedIm
6
167
0
11 Jun 2023
MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning
Jianghui Wang
Yuxuan Wang
Dongyan Zhao
Zilong Zheng
35
0
0
04 Jun 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
J. Liu
21
95
0
29 May 2023
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Weixi Feng
Wanrong Zhu
Tsu-jui Fu
Varun Jampani
Arjun Reddy Akula
Xuehai He
Sugato Basu
X. Wang
William Yang Wang
MLLM
18
137
0
24 May 2023
R2H: Building Multimodal Navigation Helpers that Respond to Help Requests
Yue Fan
Jing Gu
Kaizhi Zheng
Xin Eric Wang
14
3
0
23 May 2023
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Ziyun Zeng
Yixiao Ge
Zhan Tong
Xihui Liu
Shutao Xia
Ying Shan
16
9
0
23 May 2023
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought
Vaishnavi Himakunthala
Andy Ouyang
Daniel Philip Rose
Ryan He
Alex Mei
Yujie Lu
Chinmay Sonar
Michael Stephen Saxon
William Yang Wang
MLLM
LRM
23
2
0
23 May 2023
EDIS: Entity-Driven Image Search over Multimodal Web Content
Siqi Liu
Weixi Feng
Tsu-jui Fu
Wenhu Chen
W. Wang
VLM
28
9
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
J. Liu
Jiashi Feng
VLM
CLIP
13
17
0
22 May 2023
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
Siyuan Huang
Zhengkai Jiang
Hao Dong
Yu Qiao
Peng Gao
Hongsheng Li
LM&Ro
14
91
0
18 May 2023
Is a Video worth
n
×
n
n\times n
n
×
n
Images? A Highly Efficient Approach to Transformer-based Video Question Answering
Chenyang Lyu
Tianbo Ji
Yvette Graham
Jennifer Foster
ViT
11
0
0
16 May 2023
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
Haixin Wang
Xinlong Yang
Jianlong Chang
Di Jin
Jinan Sun
Shikun Zhang
Xiao Luo
Qi Tian
14
18
0
15 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Mohit Bansal
22
129
0
11 May 2023
VideoChat: Chat-Centric Video Understanding
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
MLLM
9
287
0
10 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Wen-tau Yih
VGen
24
3
0
04 May 2023
ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos
Zhou Yu
Lixiang Zheng
Zhou Zhao
A. Fedoseev
Jianping Fan
Kui Ren
Jun Yu
CoGe
22
5
0
04 May 2023
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime
Chuhan Zhang
Antoine Miech
Jiajun Shen
Jean-Baptiste Alayrac
Pauline Luc
VLM
VPVLM
28
2
0
03 May 2023
In-Context Learning Unlocked for Diffusion Models
Zhendong Wang
Yifan Jiang
Yadong Lu
Yelong Shen
Pengcheng He
Weizhu Chen
Zhangyang Wang
Mingyuan Zhou
VLM
DiffM
86
68
0
01 May 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
12
12
0
18 Apr 2023
Previous
1
2
3
4
Next