ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2206.01720
  4. Cited By
Revisiting the "Video" in Video-Language Understanding

Revisiting the "Video" in Video-Language Understanding

3 June 2022
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
ArXivPDFHTML

Papers citing "Revisiting the "Video" in Video-Language Understanding"

50 / 122 papers shown
Title
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal
  Models for Video Question Answering
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
Haibo Wang
Chenghang Lai
Yixuan Sun
Weifeng Ge
13
5
0
19 Jan 2024
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
Zongxin Yang
Guikun Chen
Xiaodi Li
Wenguan Wang
Yi Yang
LM&Ro
LLMAG
48
35
0
16 Jan 2024
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
S. DarshanSingh
Zeeshan Khan
Makarand Tapaswi
VLM
CLIP
26
3
0
15 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
  Programmers
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
23
9
0
03 Jan 2024
Glance and Focus: Memory Prompting for Multi-Event Video Question
  Answering
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Ziyi Bai
Ruiping Wang
Xilin Chen
89
8
0
03 Jan 2024
Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning
  for Video Question Answering
Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering
Haopeng Li
Qiuhong Ke
Mingming Gong
Tom Drummond
27
1
0
03 Jan 2024
Text-Conditioned Resampler For Long Form Video Understanding
Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar
Yongqin Xian
A. Tonioni
Andrew Zisserman
Federico Tombari
28
12
0
19 Dec 2023
Appearance-based Refinement for Object-Centric Motion Segmentation
Appearance-based Refinement for Object-Centric Motion Segmentation
Junyu Xie
Weidi Xie
Andrew Zisserman
VOS
28
3
0
18 Dec 2023
Artificial intelligence optical hardware empowers high-resolution
  hyperspectral video understanding at 1.2 Tb/s
Artificial intelligence optical hardware empowers high-resolution hyperspectral video understanding at 1.2 Tb/s
M. Makarenko
Qizhou Wang
A. Burguete-Lopez
Silvio Giancola
Bernard Ghanem
Luca Passone
A. Fratalocchi
9
1
0
17 Dec 2023
Recursive Visual Programming
Recursive Visual Programming
Jiaxin Ge
Sanjay Subramanian
Baifeng Shi
Roei Herzig
Trevor Darrell
27
4
0
04 Dec 2023
RTQ: Rethinking Video-language Understanding Based on Image-text Model
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Xiao Wang
Yaoyu Li
Tian Gan
Zheng Zhang
Jingjing Lv
Liqiang Nie
11
6
0
01 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq R. Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
33
45
0
30 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of
  Video-Language Models
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
27
29
0
29 Nov 2023
LEAP: LLM-Generation of Egocentric Action Programs
LEAP: LLM-Generation of Egocentric Action Programs
Eadom Dessalene
Michael Maynord
Cornelia Fermuller
Yiannis Aloimonos
18
3
0
29 Nov 2023
Characterizing Video Question Answering with Sparsified Inputs
Characterizing Video Question Answering with Sparsified Inputs
Shiyuan Huang
Robinson Piramuthu
Vicente Ordonez
Shih-Fu Chang
Gunnar A. Sigurdsson
13
0
0
27 Nov 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
27
3
0
25 Nov 2023
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision
  Language Models in Open-Ended Video Question Answering
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
Xiuyuan Chen
Yuan Lin
Yuchen Zhang
Weiran Huang
ELM
MLLM
18
26
0
25 Nov 2023
Vamos: Versatile Action Models for Video Understanding
Vamos: Versatile Action Models for Video Understanding
Shijie Wang
Qi Zhao
Minh Quan Do
Nakul Agarwal
Kwonjoon Lee
Chen Sun
27
19
0
22 Nov 2023
VideoCon: Robust Video-Language Alignment via Contrast Captions
VideoCon: Robust Video-Language Alignment via Contrast Captions
Hritik Bansal
Yonatan Bitton
Idan Szpektor
Kai-Wei Chang
Aditya Grover
28
14
0
15 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and
  contextual modalities
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
33
19
0
09 Nov 2023
An Empirical Study of Frame Selection for Text-to-Video Retrieval
An Empirical Study of Frame Selection for Text-to-Video Retrieval
Mengxia Wu
Min Cao
Yang Bai
Ziyin Zeng
Chen Chen
Liqiang Nie
Min Zhang
17
3
0
01 Nov 2023
MoCa: Measuring Human-Language Model Alignment on Causal and Moral
  Judgment Tasks
MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks
Allen Nie
Yuhui Zhang
Atharva Amdekar
Chris Piech
Tatsunori Hashimoto
Tobias Gerstenberg
18
33
0
30 Oct 2023
Large Language Models are Temporal and Causal Reasoners for Video
  Question Answering
Large Language Models are Temporal and Causal Reasoners for Video Question Answering
Dohwan Ko
Ji Soo Lee
Wooyoung Kang
Byungseok Roh
Hyunwoo J. Kim
LRM
33
31
0
24 Oct 2023
VidCoM: Fast Video Comprehension through Large Language Models with
  Multimodal Tools
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools
Huihui Gong
Minjing Dong
Siqi Ma
S. Çamtepe
Chang Xu
Lei Hou
Surya Nepal
VLM
MLLM
45
0
0
16 Oct 2023
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
Ziyang Wang
Yi-Lin Sung
Feng Cheng
Gedas Bertasius
Mohit Bansal
93
44
0
18 Sep 2023
EgoPCA: A New Framework for Egocentric Hand-Object Interaction
  Understanding
EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding
Yue Xu
Yong-Lu Li
Zhemin Huang
Michael Xu Liu
Cewu Lu
Yu-Wing Tai
Chi-Keung Tang
EgoV
18
9
0
05 Sep 2023
ATM: Action Temporality Modeling for Video Question Answering
ATM: Action Temporality Modeling for Video Question Answering
Junwen Chen
Jie Zhu
Yu Kong
19
1
0
05 Sep 2023
Can I Trust Your Answer? Visually Grounded Video Question Answering
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao
Angela Yao
Yicong Li
Tat-Seng Chua
28
46
0
04 Sep 2023
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer
Guangyi Chen
Xiao Liu
Guangrun Wang
Kun Zhang
Philip H.S.Torr
Xiaoping Zhang
Yansong Tang
19
18
0
16 Aug 2023
Redundancy-aware Transformer for Video Question Answering
Redundancy-aware Transformer for Video Question Answering
Yicong Li
Xun Yang
An Zhang
Chun Feng
Xiang Wang
Tat-Seng Chua
12
15
0
07 Aug 2023
Self-Adaptive Sampling for Efficient Video Question-Answering on
  Image--Text Models
Self-Adaptive Sampling for Efficient Video Question-Answering on Image--Text Models
Wei Han
Hui Chen
MingSung Kan
Soujanya Poria
24
1
0
09 Jul 2023
VideoGLUE: Video General Understanding Evaluation of Foundation Models
VideoGLUE: Video General Understanding Evaluation of Foundation Models
Liangzhe Yuan
N. B. Gundavarapu
Long Zhao
Hao Zhou
Yin Cui
...
Florian Schroff
Hartwig Adam
Ming Yang
Ting Liu
Boqing Gong
ELM
32
9
0
06 Jul 2023
ICSVR: Investigating Compositional and Syntactic Understanding in Video
  Retrieval Models
ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models
Avinash Madasu
Vasudev Lal
CoGe
37
3
0
28 Jun 2023
PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television
  Dramas
PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas
Chen Li
Xutan Peng
Teng Wang
Yixiao Ge
Mengyang Liu
Xuyuan Xu
Yexin Wang
Ying Shan
VGen
15
2
0
26 Jun 2023
Dissecting Multimodality in VideoQA Transformer Models by Impairing
  Modality Fusion
Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion
Isha Rawal
Alexander Matyasko
Shantanu Jaiswal
Basura Fernando
Cheston Tan
16
1
0
15 Jun 2023
AssistGPT: A General Multi-modal Assistant that can Plan, Execute,
  Inspect, and Learn
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
Difei Gao
Lei Ji
Luowei Zhou
Kevin Lin
Joya Chen
Zihan Fan
Mike Zheng Shou
MLLM
24
71
0
14 Jun 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang
Ansel Blume
Sha Li
Genglin Liu
Jaemin Cho
Zineng Tang
Mohit Bansal
Heng Ji
KELM
VGen
17
26
0
18 May 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Mohit Bansal
36
129
0
11 May 2023
Visual Causal Scene Refinement for Video Question Answering
Visual Causal Scene Refinement for Video Question Answering
Yushen Wei
Yang Liu
Hongfei Yan
Guanbin Li
Liang Lin
CML
12
21
0
07 May 2023
SViTT: Temporal Learning of Sparse Video-Text Transformers
SViTT: Temporal Learning of Sparse Video-Text Transformers
Yi Li
Kyle Min
Subarna Tripathi
Nuno Vasconcelos
17
12
0
18 Apr 2023
VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
Y. Liu
Guanbin Li
Jingzhou Luo
Liang Lin
BDL
LRM
38
5
0
17 Apr 2023
Multimodal Representation Learning of Cardiovascular Magnetic Resonance
  Imaging
Multimodal Representation Learning of Cardiovascular Magnetic Resonance Imaging
Jielin Qiu
Peide Huang
Makiya Nakashima
Jae-Hyeok Lee
Jiacheng Zhu
...
Byung-Hak Kim
Debbie Kwon
Douglas Weber
Ding Zhao
David Chen
SSL
14
4
0
16 Apr 2023
Verbs in Action: Improving verb understanding in video-language models
Verbs in Action: Improving verb understanding in video-language models
Liliane Momeni
Mathilde Caron
Arsha Nagrani
Andrew Zisserman
Cordelia Schmid
30
70
0
13 Apr 2023
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for
  Real-time Soccer Commentary Generation
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation
Ji Qi
Jifan Yu
Teng Tu
Kunyu Gao
Yifan Xu
...
Juanzi Li
Jie Tang
Weidong Guo
Hui Liu
Yu-Syuan Xu
23
19
0
26 Mar 2023
Dual-path Adaptation from Image to Video Transformers
Dual-path Adaptation from Image to Video Transformers
Jungin Park
Jiyoung Lee
K. Sohn
ViT
19
37
0
17 Mar 2023
ViperGPT: Visual Inference via Python Execution for Reasoning
ViperGPT: Visual Inference via Python Execution for Reasoning
Dídac Surís
Sachit Menon
Carl Vondrick
MLLM
LRM
ReLM
40
429
0
14 Mar 2023
Contrastive Video Question Answering via Video Graph Transformer
Contrastive Video Question Answering via Video Graph Transformer
Junbin Xiao
Pan Zhou
Angela Yao
Yicong Li
Richang Hong
Shuicheng Yan
Tat-Seng Chua
ViT
19
35
0
27 Feb 2023
Multimodal Subtask Graph Generation from Instructional Videos
Multimodal Subtask Graph Generation from Instructional Videos
Y. Jang
Sungryull Sohn
Lajanugen Logeswaran
Tiange Luo
Moontae Lee
Ho Hin Lee
23
9
0
17 Feb 2023
Efficient End-to-End Video Question Answering with Pyramidal Multimodal
  Transformer
Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer
Min Peng
Chongyang Wang
Yu Shi
Xiang-Dong Zhou
ViT
42
7
0
04 Feb 2023
Semi-Parametric Video-Grounded Text Generation
Semi-Parametric Video-Grounded Text Generation
Sungdong Kim
Jin-Hwa Kim
Jiyoung Lee
Minjoon Seo
VGen
17
14
0
27 Jan 2023
Previous
123
Next