Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.09522
Cited By
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
19 December 2022
Difei Gao
Luowei Zhou
Lei Ji
Linchao Zhu
Yezhou Yang
Mike Zheng Shou
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering"
16 / 16 papers shown
Title
DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes
S. Linok
Vadim Semenov
Anastasia Trunova
Oleg Bulichev
Dmitry A. Yudin
40
0
0
06 May 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Y. S. Rawat
VLM
58
1
0
11 Mar 2025
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
66
2
0
20 Nov 2024
Scene-Text Grounding for Text-Based Video Question Answering
Sheng Zhou
Junbin Xiao
Xun Yang
Peipei Song
Dan Guo
Angela Yao
Meng Wang
Tat-Seng Chua
52
1
0
22 Sep 2024
Learning Video Context as Interleaved Multimodal Sequences
S. Shao
Pengchuan Zhang
Y. Li
Xide Xia
A. Meso
Ziteng Gao
Jinheng Xie
N. Holliman
Mike Zheng Shou
41
5
0
31 Jul 2024
Koala: Key frame-conditioned long video-LLM
Reuben Tan
Ximeng Sun
Ping Hu
Jui-hsien Wang
Hanieh Deilamsalehy
Bryan A. Plummer
Bryan C. Russell
Kate Saenko
38
35
0
05 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
33
1
0
01 Apr 2024
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
Haibo Wang
Chenghang Lai
Yixuan Sun
Weifeng Ge
13
5
0
19 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
41
35
0
16 Jan 2024
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Mohit Bansal
31
129
0
11 May 2023
Video Graph Transformer for Video Question Answering
Junbin Xiao
Pan Zhou
Tat-Seng Chua
Shuicheng Yan
ViT
134
73
0
12 Jul 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Kristen Grauman
Andrew Westbury
Eugene Byrne
Zachary Chavis
Antonino Furnari
...
Mike Zheng Shou
Antonio Torralba
Lorenzo Torresani
Mingfei Yan
Jitendra Malik
EgoV
218
1,017
0
13 Oct 2021
Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering
Jungin Park
Jiyoung Lee
K. Sohn
123
99
0
29 Apr 2021
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Xiuye Gu
Tsung-Yi Lin
Weicheng Kuo
Yin Cui
VLM
ObjD
223
897
0
28 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
303
771
0
18 Apr 2021
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
278
1,939
0
09 Feb 2021
1