ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.03384
  4. Cited By
LongVLM: Efficient Long Video Understanding via Large Language Models
v1v2 (latest)

LongVLM: Efficient Long Video Understanding via Large Language Models

European Conference on Computer Vision (ECCV), 2024
4 April 2024
Yuetian Weng
Mingfei Han
Haoyu He
Xiaojun Chang
Bohan Zhuang
    VLM
ArXiv (abs)PDFHTMLGithub (98★)

Papers citing "LongVLM: Efficient Long Video Understanding via Large Language Models"

50 / 63 papers shown
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models
Ruosen Zhao
Zhikang Zhang
Jialei Xu
Jiahao Chang
Dong Chen
Lingyun Li
Weijian Sun
Zizhuang Wei
VLMLRM
285
4
0
28 Nov 2025
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Xiyang Wu
Zongxia Li
Jihui Jin
Guangyao Shi
Gouthaman KV
Vishnu Raj
Nilotpal Sinha
Jingxi Chen
Fan Du
Dinesh Manocha
LRM
196
0
0
23 Nov 2025
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Arun Ramachandran
Ramaswamy Govindarajan
M. Annavaram
Prakash Raghavendra
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
201
1
0
15 Nov 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
239
9
0
31 Oct 2025
FeatureFool: Zero-Query Fooling of Video Models via Feature Map
FeatureFool: Zero-Query Fooling of Video Models via Feature Map
Duoxun Tang
Xi Xiao
Guangwu Hu
Kangkang Sun
Xiao Yang
Dongyang Chen
Qing Li
Yongjie Yin
Jiyao Wang
AAML
309
1
0
21 Oct 2025
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
Jiaying Zhu
Yurui Zhu
Xin Lu
Wenrui Yan
Dong Li
Kunlin Liu
Xueyang Fu
Zheng-Jun Zha
MQVLM
295
4
0
18 Oct 2025
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
Minji Kim
Taekyung Kim
Bohyung Han
149
2
0
15 Oct 2025
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang
Jing Bi
Pinxin Liu
Zhenyu Pan
Mingqian Feng
...
Zeliang Zhang
Daiki Shimada
Han Liu
Jiebo Luo
Chenliang Xu
MLLMOffRLVLMLRM
895
8
0
06 Oct 2025
AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
Xian Zhang
Zexi Wu
Zinuo Li
Hongming Xu
Luqi Gong
F. Boussaïd
Naoufel Werghi
Mohammed Bennamoun
VGen
139
2
0
03 Oct 2025
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
Ashim Dahal
Ankit Ghimire
Saydul Akbar Murad
Nick Rahimi
LRM
242
0
0
01 Oct 2025
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Alessio Devoto
Maximilian Jeblick
Simon Jégou
MQVLM
164
14
0
01 Oct 2025
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Chenyue Zhou
Mingxuan Wang
Yanbiao Ma
Chenxu Wu
Wanyi Chen
...
Guoli Jia
Lingling Li
Z. Lu
Y. Lu
Wenhan Luo
LRM
634
14
0
29 Sep 2025
EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking
EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking
Yuki Sakai
Ryosuke Furuta
Juichun Yen
Yoichi Sato
145
0
0
26 Sep 2025
Poisoning Prompt-Guided Sampling in Video Large Language Models
Poisoning Prompt-Guided Sampling in Video Large Language Models
Yuxin Cao
Wei Song
Jingling Xue
Jin Song Dong
AAML
153
1
0
25 Sep 2025
Track-On2: Enhancing Online Point Tracking with Memory
Track-On2: Enhancing Online Point Tracking with Memory
Görkay Aydemir
Weidi Xie
Fatma Guney
VOT3DV
302
3
0
23 Sep 2025
Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
Zhitao Zeng
Guojian Yuan
Junyuan Mao
Yuxuan Wang
Xiaoshuang Jia
Yueming Jin
330
0
0
22 Sep 2025
Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs
Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs
Qinyu Chen
Jiawen Qi
145
0
0
20 Sep 2025
AToken: A Unified Tokenizer for Vision
AToken: A Unified Tokenizer for Vision
Jiasen Lu
Liangchen Song
Mingze Xu
Byeongjoo Ahn
Yanjun Wang
Chen Chen
Afshin Dehghan
Yinfei Yang
ViT
342
14
0
17 Sep 2025
When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
Pengcheng Fang
Yuxia Chen
Rui Guo
VGen
134
3
0
21 Aug 2025
Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models
Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models
Thanh-Dat Truong
Huu-Thien Tran
Tran Thai Son
Bhiksha Raj
Khoa Luu
373
2
0
19 Aug 2025
Failures to Surface Harmful Contents in Video Large Language Models
Failures to Surface Harmful Contents in Video Large Language Models
Yuxin Cao
Wei Song
Derui Wang
Jingling Xue
Jin Song Dong
AAML
229
2
0
14 Aug 2025
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
Jianxiang He
Shaoguang Wang
Weiyu Guo
Ziyang Chen
Ziyang Chen
Yijie Xu
314
0
0
09 Aug 2025
VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation
VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation
Ayaan Nooruddin Siddiqui
Mahnoor Zaidi
Ayesha Nazneen Shahbaz
Priyadarshini Chatterjee
Krishnan Menon Iyer
314
0
0
09 Aug 2025
Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling
Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling
Aarav Mehta
Priya Deshmukh
Vikram Singh
Siddharth Malhotra
Krishnan Menon Iyer
Tanvi Iyer
MedIm
347
0
0
09 Aug 2025
DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation
DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation
Vikram Singh
Kabir Malhotra
Rohan Desai
Ananya Shankaracharya
Priyadarshini Chatterjee
Krishnan Menon Iyer
MedIm
399
0
0
09 Aug 2025
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning
H. Zhang
Xin Gu
Jiawen Li
Chixiang Ma
Sule Bai
Chubin Zhang
Bowen Zhang
Zhichao Zhou
Dongliang He
Yansong Tang
OffRLLRM
265
45
0
06 Aug 2025
Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
Minghang Zheng
Yuxin Peng
Benyuan Sun
Yi Yang
Yang Liu
201
0
0
06 Aug 2025
Deeply Dual Supervised learning for melanoma recognition
Deeply Dual Supervised learning for melanoma recognition
Rujosh Polma
Krishnan Menon Iyer
278
0
0
04 Aug 2025
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Sobhan Asasi
Mohamed Ilyas Lakhal
Ozge Mercanoglu Sincan
Richard Bowden
SLR
300
1
0
31 Jul 2025
FMimic: Foundation Models are Fine-grained Action Learners from Human Videos
FMimic: Foundation Models are Fine-grained Action Learners from Human VideosThe international journal of robotics research (IJRR), 2025
Guangyan Chen
Meiling Wang
Te Cui
Yao Mu
Haoyang Lu
...
Mengxiao Hu
Tianxing Zhou
M. Fu
Yi Yang
Yufeng Yue
LM&RoVLM
287
6
0
28 Jul 2025
A Survey of Token Compression for Efficient Multimodal Large Language Models
A Survey of Token Compression for Efficient Multimodal Large Language Models
Kele Shao
Keda Tao
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin
Yang Sui
Huan Wang
713
12
0
27 Jul 2025
Scaling RL to Long Videos
Scaling RL to Long Videos
Yukang Chen
Wei Huang
Baifeng Shi
Qinghao Hu
Hanrong Ye
...
Xiaojuan Qi
Sifei Liu
Hongxu Yin
Yao Lu
Song Han
OffRLAI4TSVLMLRM
540
64
0
10 Jul 2025
Video, How Do Your Tokens Merge?
Video, How Do Your Tokens Merge?
Sam Pollard
Michael Wray
ViTMoMe
323
1
0
04 Jun 2025
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Ujjwal Upadhyay
Mukul Ranjan
Zhiqiang Shen
Mohamed Elhoseiny
VLM
249
9
0
30 May 2025
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
Zheng Zhang
Zhen Sun
Zhenru Zhang
Zifan Peng
Yuemeng Zhao
Liang Luo
Zeren Luo
Ruiting Zuo
Xinlei He
373
5
0
07 May 2025
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Yucheng Li
Huiqiang Jiang
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Jianfeng Gao
Yue Yang
Lili Qiu
437
23
0
22 Apr 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Mavors: Multi-granularity Video Representation for Multimodal Large Language ModelACM Multimedia (ACM MM), 2025
Yang Shi
Jiaheng Liu
Yushuo Guan
Zhikai Wu
Yujiao Shi
...
Bohan Zeng
Wei Zhang
Fuzheng Zhang
Wenjing Yang
Di Zhang
VGenVLM
469
17
0
14 Apr 2025
Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding
Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding
Dibyadip Chatterjee
Edoardo Remelli
Yale Song
Bugra Tekin
Abhay Mittal
...
Shreyas Hampali
Eric Sauser
Shugao Ma
Angela Yao
Fadime Sener
VLM
301
6
0
10 Apr 2025
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Ziyi Wang
Haoran Wu
Yiming Rong
Deyang Jiang
Yixin Zhang
Yue Zhao
Shuang Xu
Bo Xu
VLM
330
5
0
09 Apr 2025
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Hanping Zhang
Yuhong Guo
OffRL
330
2
0
03 Apr 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
341
8
0
26 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zhengyang Liang
Ao Li
Yang Tian
Bo Zhao
VGenVLM
631
39
0
24 Mar 2025
PVChat: Personalized Video Chat with One-Shot Learning
PVChat: Personalized Video Chat with One-Shot Learning
Yufei Shi
Weilong Yan
Gang Xu
Yumeng Li
Yongqian Li
Hao Sun
Fei Richard Yu
Ming Li
Si Yong Yeo
430
6
0
21 Mar 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
Jianxiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
419
23
0
17 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Xiping Hu
Yang Liu
Ziwei Sun
Longji Xu
VLM
1.3K
33
0
13 Mar 2025
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Huaying Yuan
Zhengyang Liang
Minhao Qin
Hongjin Qian
Yan Shu
Zhicheng Dou
Ji-Rong Wen
Andrii Zadaianchuk
VOSRALMVLM
479
13
0
12 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
1.1K
15
0
11 Mar 2025
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh
Zhifeng Kong
Sonal Kumar
S. Sakshi
Jaehyeon Kim
Ming-Yu Liu
Rafael Valle
Dinesh Manocha
Bryan Catanzaro
MLLMAuLLMLRM
422
108
0
06 Mar 2025
EgoLife: Towards Egocentric Life Assistant
EgoLife: Towards Egocentric Life AssistantComputer Vision and Pattern Recognition (CVPR), 2025
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
Xinyu Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Ziwei Liu
EgoV
337
12
0
05 Mar 2025
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video GroundingInternational Conference on Learning Representations (ICLR), 2025
Xin Gu
Yaojie Shen
Chenxi Luo
Tiejian Luo
Yan Huang
Lu Ma
Heng Fan
L. Zhang
356
10
0
16 Feb 2025
12
Next
Page 1 of 2