Towards Long-Form Video Understanding

21 June 2021

Papers citing "Towards Long-Form Video Understanding"

31 / 31 papers shown

Title
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding Shehreen Azad Vibhav Vineet Y. S. Rawat VLM 56 1 0 11 Mar 2025
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames Pinelopi Papalampidi Skanda Koppula Shreya Pathak Justin T Chiu Joseph Heyward Viorica Patraucean Jiajun Shen Antoine Miech Andrew Zisserman Aida Nematzdeh VLM 56 23 0 31 Dec 2024
Neptune: The Long Orbit to Benchmarking Long Video Understanding Arsha Nagrani Mingda Zhang Ramin Mehran Rachel Hornung N. B. Gundavarapu ... Boqing Gong Cordelia Schmid Mikhail Sirotenko Yukun Zhu Tobias Weyand 98 4 0 12 Dec 2024
Video LLMs for Temporal Reasoning in Long Videos Fawad Javed Fateh Umer Ahmed Hamza Khan M. Zia Quoc-Huy Tran VLM 77 0 0 04 Dec 2024
Human-inspired Perspectives: A Survey on AI Long-term Memory Zihong He Weizhe Lin Hao Zheng Fan Zhang Matt Jones Laurence Aitchison X. Xu Miao Liu Per Ola Kristensson Junxiao Shen 65 2 0 01 Nov 2024
Learning Video Context as Interleaved Multimodal Sequences S. Shao Pengchuan Zhang Y. Li Xide Xia A. Meso Ziteng Gao Jinheng Xie N. Holliman Mike Zheng Shou 41 5 0 31 Jul 2024
MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD Ioanna Ntinou Enrique Sanchez Georgios Tzimiropoulos 27 0 0 11 Jun 2024
Encoding and Controlling Global Semantics for Long-form Video Question Answering Thong Nguyen Zhiyuan Hu Xiaobao Wu Cong-Duy Nguyen See-Kiong Ng A. Luu 32 2 0 30 May 2024
CinePile: A Long Video Question Answering Dataset and Benchmark Ruchit Rawal Khalid Saifullah Ronen Basri David Jacobs Gowthami Somepalli Tom Goldstein 38 39 0 14 May 2024
Koala: Key frame-conditioned long video-LLM Reuben Tan Ximeng Sun Ping Hu Jui-hsien Wang Hanieh Deilamsalehy Bryan A. Plummer Bryan C. Russell Kate Saenko 38 35 0 05 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question Answering Bo Zou Chao Yang Yu Qiao Chengbin Quan Youjian Zhao VGen 31 1 0 01 Apr 2024
Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas Carlo Bretti Pascal Mettes Hendrik Vincent Koops Daan Odijk N. V. Noord 17 4 0 29 Jan 2024
Action Scene Graphs for Long-Form Understanding of Egocentric Videos Ivan Rodin Antonino Furnari Kyle Min Subarna Tripathi G. Farinella EgoV 16 12 0 06 Dec 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains Rohan Myer Krishnan Zitian Tang Zhiqiu Yu Chen Sun 33 1 0 30 Nov 2023
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition Jiaming Zhou Hanjun Li Kun-Yu Lin Junwei Liang 13 1 0 28 Nov 2023
Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition Dimitrios Daskalakis Nikolaos Gkalelis Vasileios Mezaris 25 0 0 24 Aug 2023
How can objects help action recognition? Xingyi Zhou Anurag Arnab Chen Sun Cordelia Schmid 30 14 0 20 Jun 2023
Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning Nikhil Singh Chih-Wei Wu Iroro Orife Mahdi M. Kalayeh 23 2 0 12 Apr 2023
On the Benefits of 3D Pose and Tracking for Human Action Recognition Jathushan Rajasegaran Georgios Pavlakos Angjoo Kanazawa Christoph Feichtenhofer Jitendra Malik 19 30 0 03 Apr 2023
Holistic Interaction Transformer Network for Action Detection Gueter Josmy Faure Min-Hung Chen S. Lai 31 36 0 23 Oct 2022
Video Question Answering with Iterative Video-Text Co-Tokenization A. Piergiovanni K. Morton Weicheng Kuo Michael S. Ryoo A. Angelova 10 17 0 01 Aug 2022
EgoEnv: Human-centric environment representations from egocentric video Tushar Nagarajan Santhosh Kumar Ramakrishnan Ruta Desai James M. Hillis Kristen Grauman EgoV 19 19 0 22 Jul 2022
The Anatomy of Video Editing: A Dataset and Benchmark Suite for AI-Assisted Video Editing Dawit Mureja Argaw Fabian Caba Heilbron Joon-Young Lee Markus Woodson In So Kweon VGen 37 22 0 20 Jul 2022
Depth-Guided Sparse Structure-from-Motion for Movies and TV Shows Shengqing Liu Xiaohan Nie Raffay Hamid 12 16 0 05 Apr 2022
Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos Shao-Wei Liu Subarna Tripathi Somdeb Majumdar Xiaolong Wang EgoV 20 93 0 04 Apr 2022
Long Movie Clip Classification with State-Space Video Models Md. Mohaiminul Islam Gedas Bertasius VLM 25 100 0 04 Apr 2022
Video Transformers: A Survey Javier Selva A. S. Johansen Sergio Escalera Kamal Nasrollahi T. Moeslund Albert Clapés ViT 20 101 0 16 Jan 2022
Masked Feature Prediction for Self-Supervised Visual Pre-Training Chen Wei Haoqi Fan Saining Xie Chaoxia Wu Alan Yuille Christoph Feichtenhofer ViT 25 651 0 16 Dec 2021
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Yanghao Li Chaoxia Wu Haoqi Fan K. Mangalam Bo Xiong Jitendra Malik Christoph Feichtenhofer ViT 25 671 0 02 Dec 2021
Audiovisual SlowFast Networks for Video Recognition Fanyi Xiao Yong Jae Lee Kristen Grauman Jitendra Malik Christoph Feichtenhofer 192 204 0 23 Jan 2020
ECO: Efficient Convolutional Network for Online Video Understanding Mohammadreza Zolfaghari Kamaljeet Singh Thomas Brox 119 495 0 24 Apr 2018