v1v2 (latest)

LongVLM: Efficient Long Video Understanding via Large Language Models

European Conference on Computer Vision (ECCV), 2024

4 April 2024

Yuetian Weng

Mingfei Han

Haoyu He

Xiaojun Chang

Bohan Zhuang

VLM

ArXiv (abs)PDF HTML Github (98★)

Papers citing "LongVLM: Efficient Long Video Understanding via Large Language Models"

50 / 63 papers shown

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

285

28 Nov 2025

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

196

23 Nov 2025

Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding

Arun Ramachandran

Ramaswamy Govindarajan

M. Annavaram

Prakash Raghavendra

Hossein Entezari Zarch

Lei Gao

Chaoyi Jiang

201

15 Nov 2025

FOCUS: Efficient Keyframe Selection for Long Video Understanding

239

31 Oct 2025

FeatureFool: Zero-Query Fooling of Video Models via Feature Map

309

21 Oct 2025

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

295

18 Oct 2025

Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs

Minji Kim

Taekyung Kim

Bohyung Han

149

15 Oct 2025

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

...

895

06 Oct 2025

AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

139

03 Oct 2025

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

242

01 Oct 2025

Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

164

01 Oct 2025

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

...

634

29 Sep 2025

EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking

145

26 Sep 2025

Poisoning Prompt-Guided Sampling in Video Large Language Models

153

25 Sep 2025

Track-On2: Enhancing Online Point Tracking with Memory

302

23 Sep 2025

Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

330

22 Sep 2025

Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs

Qinyu Chen

Jiawen Qi

145

20 Sep 2025

AToken: A Unified Tokenizer for Vision

342

17 Sep 2025

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

134

21 Aug 2025

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

373

19 Aug 2025

Failures to Surface Harmful Contents in Video Large Language Models

229

14 Aug 2025

VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

314

09 Aug 2025

VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation

Ayaan Nooruddin Siddiqui

Mahnoor Zaidi

Ayesha Nazneen Shahbaz

Priyadarshini Chatterjee

Krishnan Menon Iyer

314

09 Aug 2025

Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling

347

09 Aug 2025

DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation

Vikram Singh

Kabir Malhotra

Rohan Desai

Ananya Shankaracharya

Priyadarshini Chatterjee

Krishnan Menon Iyer

MedIm

399

09 Aug 2025

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

265

06 Aug 2025

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

201

06 Aug 2025

Deeply Dual Supervised learning for melanoma recognition

Rujosh Polma

Krishnan Menon Iyer

278

04 Aug 2025

Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi

Mohamed Ilyas Lakhal

Ozge Mercanoglu Sincan

Richard Bowden

SLR

300

31 Jul 2025

FMimic: Foundation Models are Fine-grained Action Learners from Human VideosThe international journal of robotics research (IJRR), 2025

...

287

28 Jul 2025

A Survey of Token Compression for Efficient Multimodal Large Language Models

713

27 Jul 2025

Scaling RL to Long Videos

...

540

10 Jul 2025

Video, How Do Your Tokens Merge?

Sam Pollard

Michael Wray

ViT MoMe

323

04 Jun 2025

Time Blindness: Why Video-Language Models Can't See What Humans Can?

249

30 May 2025

"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

373

07 May 2025

MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

...

437

22 Apr 2025

Mavors: Multi-granularity Video Representation for Multimodal Large Language ModelACM Multimedia (ACM MM), 2025

...

469

14 Apr 2025

Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding

...

301

10 Apr 2025

LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding

330

09 Apr 2025

Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards

Hanping Zhang

Yuhong Guo

OffRL

330

03 Apr 2025

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

341

26 Mar 2025

Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding

631

24 Mar 2025

PVChat: Personalized Video Chat with One-Shot Learning

430

21 Mar 2025

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

419

17 Mar 2025

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

1.3K

13 Mar 2025

Memory-enhanced Retrieval Augmentation for Long Video Understanding

479

12 Mar 2025

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025

1.1K

11 Mar 2025

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

422

108

06 Mar 2025

EgoLife: Towards Egocentric Life AssistantComputer Vision and Pattern Recognition (CVPR), 2025

...

337

05 Mar 2025

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video GroundingInternational Conference on Learning Representations (ICLR), 2025

356

16 Feb 2025