ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1611.04021
  4. Cited By
Leveraging Video Descriptions to Learn Video Question Answering
v1v2 (latest)

Leveraging Video Descriptions to Learn Video Question Answering

12 November 2016
Kuo-Hao Zeng
Tseng-Hung Chen
Ching-Yao Chuang
Yuan-Hong Liao
Juan Carlos Niebles
Min Sun
ArXiv (abs)PDFHTML

Papers citing "Leveraging Video Descriptions to Learn Video Question Answering"

50 / 84 papers shown
TextVidBench: A Benchmark for Long Video Scene Text Understanding
TextVidBench: A Benchmark for Long Video Scene Text Understanding
Yangyang Zhong
Ji Qi
Yuan Yao
Pengxin Luo
Yunfeng Yan
Donglian Qi
Zhiyuan Liu
Tat-Seng Chua
346
0
0
05 Jun 2025
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
Ji Qi
Yuan Yao
Yushi Bai
Bin Xu
Juanzi Li
Zhiyuan Liu
Tat-Seng Chua
313
5
0
21 Apr 2025
Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
EGVM
1.1K
0
0
18 Feb 2025
Progress-Aware Video Frame Captioning
Progress-Aware Video Frame CaptioningComputer Vision and Pattern Recognition (CVPR), 2024
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
687
7
0
03 Dec 2024
Grounded Video Caption Generation
Grounded Video Caption Generation
Evangelos Kazakos
Cordelia Schmid
Josef Sivic
296
0
0
12 Nov 2024
Prompting Video-Language Foundation Models with Domain-specific
  Fine-grained Heuristics for Video Question Answering
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering
Ting Yu
Kunhao Fu
Shuhui Wang
Qingming Huang
Jun Yu
328
10
0
12 Oct 2024
Multi-granularity Contrastive Cross-modal Collaborative Generation for
  End-to-End Long-term Video Question Answering
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question AnsweringIEEE Transactions on Image Processing (TIP), 2024
Ting Yu
Kunhao Fu
Jian Zhang
Qingming Huang
Jun Yu
266
10
0
12 Oct 2024
Investigating Video Reasoning Capability of Large Language Models with
  Tropes in Movies
Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies
Hung-Ting Su
Chun-Tong Chao
Ya-Ching Hsu
Xudong Lin
Yulei Niu
Hung-Yi Lee
Winston H. Hsu
LRM
250
1
0
16 Jun 2024
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Jongwoo Park
Kanchana Ranasinghe
Kumara Kahatapitiya
Wonjeong Ryoo
Donghyun Kim
Michael S. Ryoo
439
66
0
13 Jun 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
  in Videos
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He
Weixi Feng
Kaizhi Zheng
Yujie Lu
Wanrong Zhu
...
Zhengyuan Yang
Kevin Lin
William Yang Wang
Lijuan Wang
Xin Eric Wang
VGenLRM
792
37
0
12 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data PerspectivesAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
633
34
1
09 Jun 2024
CausalChaos! Dataset for Comprehensive Causal Action Question Answering
  Over Longer Causal Chains Grounded in Dynamic Visual Scenes
CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
Paritosh Parmar
Eric Peh
Ruirui Chen
Ting En Lam
Yuhan Chen
Elston Tan
Basura Fernando
CML
382
13
0
01 Apr 2024
Cross-Modal Reasoning with Event Correlation for Video Question
  Answering
Cross-Modal Reasoning with Event Correlation for Video Question Answering
Chengxiang Yin
Zhengping Che
Kun Wu
Zhiyuan Xu
Qinru Qiu
Jian Tang
210
0
0
20 Dec 2023
Long Story Short: a Summarize-then-Search Method for Long Video Question
  Answering
Long Story Short: a Summarize-then-Search Method for Long Video Question Answering
Jiwan Chung
Youngjae Yu
451
7
0
02 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering
  (VQA) Approaches, Challenges, and Opportunities
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesInformation Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
429
79
0
01 Nov 2023
Learning to Summarize and Answer Questions about a Virtual Robot's Past
  Actions
Learning to Summarize and Answer Questions about a Virtual Robot's Past ActionsAutonomous Robots (Auton. Robots), 2023
Chad DeChant
Iretiayo Akinola
Daniel Bauer
240
13
0
16 Jun 2023
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
  Dataset for Evaluating Video Chain-of-Thought
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-ThoughtConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Vaishnavi Himakunthala
Andy Ouyang
Daniel Philip Rose
Ryan He
Alex Mei
Yujie Lu
Chinmay Sonar
Michael Stephen Saxon
William Y. Wang
MLLMLRM
348
2
0
23 May 2023
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation
Xilun Chen
L. Yu
Wenhan Xiong
Barlas Ouguz
Yashar Mehdad
Anuj Kumar
VGen
196
4
0
04 May 2023
ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
  over Untrimmed Videos
ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed VideosComputer Vision and Pattern Recognition (CVPR), 2023
Zhou Yu
Lixiang Zheng
Zhou Zhao
A. Fedoseev
Jianping Fan
Kui Ren
Jun Yu
CoGe
361
23
0
04 May 2023
A Review of Deep Learning for Video Captioning
A Review of Deep Learning for Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Moloud Abdar
Meenakshi Kollati
Swaraja Kuraparthi
Farhad Pourpanah
Daniel J. McDuff
...
Shuicheng Yan
Abduallah A. Mohamed
Abbas Khosravi
Xiaoshi Zhong
Fatih Porikli
3DV
250
46
0
22 Apr 2023
Learning Situation Hyper-Graphs for Video Question Answering
Learning Situation Hyper-Graphs for Video Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2023
Aisha Urooj Khan
Hilde Kuehne
Bo Wu
Kim Chheu
Walid Bousselham
Chuang Gan
N. Lobo
M. Shah
272
23
0
18 Apr 2023
Language Models are Causal Knowledge Extractors for Zero-shot Video
  Question Answering
Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering
Hung-Ting Su
Yulei Niu
Xudong Lin
Winston H. Hsu
Shih-Fu Chang
VGenELM
319
13
0
07 Apr 2023
Connecting Vision and Language with Video Localized Narratives
Connecting Vision and Language with Video Localized NarrativesComputer Vision and Pattern Recognition (CVPR), 2023
P. Voigtlaender
Soravit Changpinyo
Jordi Pont-Tuset
Radu Soricut
V. Ferrari
VGen
397
31
0
22 Feb 2023
Summarize the Past to Predict the Future: Natural Language Descriptions
  of Context Boost Multimodal Object Interaction Anticipation
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction AnticipationComputer Vision and Pattern Recognition (CVPR), 2023
Razvan-George Pasca
Alexey Gavryushin
Muhammad Hamza
Yen-Ling Kuo
Kaichun Mo
Luc Van Gool
Otmar Hilliges
Xi Wang
572
23
0
22 Jan 2023
Learning Fine-Grained Visual Understanding for Video Question Answering
  via Decoupling Spatial-Temporal Modeling
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal ModelingBritish Machine Vision Conference (BMVC), 2022
Hsin-Ying Lee
Hung-Ting Su
Bing-Chen Tsai
Tsung-Han Wu
Jia-Fong Yeh
Winston H. Hsu
372
2
0
08 Oct 2022
EgoTaskQA: Understanding Human Tasks in Egocentric Videos
EgoTaskQA: Understanding Human Tasks in Egocentric VideosNeural Information Processing Systems (NeurIPS), 2022
Baoxiong Jia
Ting Lei
Song-Chun Zhu
Siyuan Huang
EgoV
274
107
0
08 Oct 2022
M^4I: Multi-modal Models Membership Inference
M^4I: Multi-modal Models Membership InferenceNeural Information Processing Systems (NeurIPS), 2022
Pingyi Hu
Zihan Wang
Ruoxi Sun
Hu Wang
Minhui Xue
241
38
0
15 Sep 2022
WildQA: In-the-Wild Video Question Answering
WildQA: In-the-Wild Video Question AnsweringInternational Conference on Computational Linguistics (COLING), 2022
Santiago Castro
Naihao Deng
Pingxuan Huang
Mihai Burzo
Amélie Reymond
359
9
0
14 Sep 2022
Foundations and Trends in Multimodal Machine Learning: Principles,
  Challenges, and Open Questions
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open QuestionsACM Computing Surveys (ACM CSUR), 2022
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
347
200
0
07 Sep 2022
Equivariant and Invariant Grounding for Video Question Answering
Equivariant and Invariant Grounding for Video Question AnsweringACM Multimedia (ACM MM), 2022
Yicong Li
Xiang Wang
Junbin Xiao
Tat-Seng Chua
228
36
0
26 Jul 2022
Invariant Grounding for Video Question Answering
Invariant Grounding for Video Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2022
Yicong Li
Xiang Wang
Junbin Xiao
Wei Ji
Tat-Seng Chua
OOD
245
116
0
06 Jun 2022
Learning to Retrieve Videos by Asking Questions
Learning to Retrieve Videos by Asking QuestionsACM Multimedia (ACM MM), 2022
Avinash Madasu
Junier Oliva
Gedas Bertasius
VGen
347
19
0
11 May 2022
Learning to Answer Visual Questions from Web Videos
Learning to Answer Visual Questions from Web VideosIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
408
40
0
10 May 2022
3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social
  Media Short Videos
3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short VideosComputer Vision and Pattern Recognition (CVPR), 2022
Vikram Gupta
Trisha Mittal
Puneet Mathur
Vaibhav Mishra
Mayank Maheshwari
Aniket Bera
Debdoot Mukherjee
Tianyi Zhou
VGen
297
14
0
28 Mar 2022
Video Question Answering: Datasets, Algorithms and Challenges
Video Question Answering: Datasets, Algorithms and ChallengesConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yaoyao Zhong
Junbin Xiao
Wei Ji
Yicong Li
Wei Deng
Tat-Seng Chua
358
118
0
02 Mar 2022
NEWSKVQA: Knowledge-Aware News Video Question Answering
NEWSKVQA: Knowledge-Aware News Video Question AnsweringPacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2022
Pranay Gupta
Manish Gupta
305
9
0
08 Feb 2022
Video as Conditional Graph Hierarchy for Multi-Granular Question
  Answering
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
Junbin Xiao
Angela Yao
Zhiyuan Liu
Yicong Li
Wei Ji
Tat-Seng Chua
391
140
0
12 Dec 2021
Question Answering Survey: Directions, Challenges, Datasets, Evaluation
  Matrices
Question Answering Survey: Directions, Challenges, Datasets, Evaluation Matrices
Hariom A. Pandya
Brijesh S. Bhatt
213
34
0
07 Dec 2021
Simple Dialogue System with AUDITED
Simple Dialogue System with AUDITEDBritish Machine Vision Conference (BMVC), 2021
Eugenio Clerico
Piotr Koniusz
218
2
0
22 Oct 2021
Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$
  Videos
Pano-AVQA: Grounded Audio-Visual Question Answering on 360∘^\circ∘ VideosIEEE International Conference on Computer Vision (ICCV), 2021
Heeseung Yun
Youngjae Yu
Wonsuk Yang
Kangil Lee
Gunhee Kim
326
121
0
11 Oct 2021
TrUMAn: Trope Understanding in Movies and Animations
TrUMAn: Trope Understanding in Movies and AnimationsInternational Conference on Information and Knowledge Management (CIKM), 2021
Hung-Ting Su
Po-Wei Shen
Bing-Chen Tsai
Wen-Feng Cheng
Ke-Jyun Wang
Winston H. Hsu
193
6
0
10 Aug 2021
iReason: Multimodal Commonsense Reasoning using Videos and Natural
  Language with Interpretability
iReason: Multimodal Commonsense Reasoning using Videos and Natural Language with Interpretability
Andrew Wang
Vasu Sharma
CML
244
5
0
25 Jun 2021
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsComputer Vision and Pattern Recognition (CVPR), 2021
Junbin Xiao
Xindi Shang
Angela Yao
Tat-Seng Chua
490
776
0
18 May 2021
Relation-aware Hierarchical Attention Framework for Video Question
  Answering
Relation-aware Hierarchical Attention Framework for Video Question AnsweringInternational Conference on Multimedia Retrieval (ICMR), 2021
Fangtao Li
Ting Bai
Chenyu Cao
Zihe Liu
C. Yan
Bin Wu
266
14
0
13 May 2021
Video Question Answering with Phrases via Semantic Roles
Video Question Answering with Phrases via Semantic RolesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2021
Arka Sadhu
Kan Chen
Ram Nevatia
203
16
0
08 Apr 2021
Visual Semantic Role Labeling for Video Understanding
Visual Semantic Role Labeling for Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2021
Arka Sadhu
Tanmay Gupta
Mark Yatskar
Ram Nevatia
Aniruddha Kembhavi
VLM
426
91
0
02 Apr 2021
AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
AGQA: A Benchmark for Compositional Spatio-Temporal ReasoningComputer Vision and Pattern Recognition (CVPR), 2021
Madeleine Grunde-McLaughlin
Ranjay Krishna
Maneesh Agrawala
CoGe
318
151
0
30 Mar 2021
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network
  for Video Reasoning over Traffic Events
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic EventsComputer Vision and Pattern Recognition (CVPR), 2021
Kepeng Xu
He Huang
Jun Liu
ViTLRM
538
116
0
29 Mar 2021
On Semantic Similarity in Video Retrieval
On Semantic Similarity in Video RetrievalComputer Vision and Pattern Recognition (CVPR), 2021
Michael Wray
Hazel Doughty
Dima Damen
297
78
0
18 Mar 2021
Narration Generation for Cartoon Videos
Narration Generation for Cartoon Videos
Nikos Papasarantopoulos
Shay B. Cohen
VGen
226
2
0
17 Jan 2021
12
Next
Page 1 of 2