ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.15841
  4. Cited By
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
  Models

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

22 July 2024
Mingze Xu
Mingfei Gao
Zhe Gan
Hong-You Chen
Zhengfeng Lai
Haiming Gang
Kai Kang
Afshin Dehghan
ArXiv (abs)PDFHTMLHuggingFace (41 upvotes)

Papers citing "SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models"

36 / 36 papers shown
Title
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Xiyang Wu
Zongxia Li
Jihui Jin
Guangyao Shi
Gouthaman KV
Vishnu Raj
Nilotpal Sinha
Jingxi Chen
Fan Du
Dinesh Manocha
84
0
0
23 Nov 2025
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System
Zhiyu Xu
Weilong Yan
Yufei Shi
Xin Meng
Tao He
Huiping Zhuang
Ming Li
Hehe Fan
LLMAGLRM
142
0
0
22 Nov 2025
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
Wenlong Li
Yifei Xu
Yuan Rao
Zhenhua Wang
Shuiguang Deng
128
0
0
26 Oct 2025
D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
Y. Huang
Yizhou Wang
Yun Fu
VLM
66
0
0
09 Oct 2025
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang
Jing Bi
Pinxin Liu
Zhenyu Pan
Mingqian Feng
...
Zeliang Zhang
Daiki Shimada
Han Liu
Jiebo Luo
Chenliang Xu
MLLMOffRLVLMLRM
538
8
0
06 Oct 2025
UniVid: The Open-Source Unified Video Model
UniVid: The Open-Source Unified Video Model
Jiabin Luo
Junhui Lin
Zeyu Zhang
Biao Wu
Meng Fang
Ling-Hao Chen
Hao Tang
VGen
214
6
0
29 Sep 2025
SAIL-VL2 Technical Report
SAIL-VL2 Technical Report
Weijie Yin
Yongjie Ye
Fangxun Shu
Yue Liao
Zijian Kang
...
Han Wang
Wenzhuo Liu
Xiao Liang
Shuicheng Yan
Chao Feng
LRMVLM
244
2
0
17 Sep 2025
AToken: A Unified Tokenizer for Vision
AToken: A Unified Tokenizer for Vision
Jiasen Lu
Liangchen Song
Mingze Xu
Byeongjoo Ahn
Yanjun Wang
Chen Chen
Afshin Dehghan
Yinfei Yang
ViT
188
6
0
17 Sep 2025
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Hyungjin Chung
Hyelin Nam
J. Kim
Hyojun Go
Byeongjun Park
Junho Kim
J. Lee
Seongsu Ha
Byung-Hoon Kim
113
0
0
09 Sep 2025
Harnessing Object Grounding for Time-Sensitive Video Understanding
Harnessing Object Grounding for Time-Sensitive Video Understanding
Tz-Ying Wu
S. N. Sridhar
Subarna Tripathi
85
0
0
08 Sep 2025
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Honglu Zhou
Xiangyu Peng
Shrikant B. Kendre
Michael S Ryoo
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
92
0
0
03 Sep 2025
FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models
FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models
Yiming Yang
Hongbin Lin
Yueru Luo
Suzhong Fu
C. Zheng
Xinrui Yan
Shuqi Mei
Kun Tang
Shuguang Cui
Zhen Li
LRM
254
1
0
31 Jul 2025
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao
Keda Tao
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin
Yang Sui
Huan Wang
429
9
0
27 Jul 2025
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language ModelComputer Vision and Pattern Recognition (CVPR), 2025
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
332
5
0
01 Jul 2025
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang
Bo Feng
Zhengfeng Lai
Mingze Xu
Shiyu Li
Weifeng Ge
Afshin Dehghan
Meng Cao
Ping Huang
OffRL
500
4
0
08 May 2025
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
Yogesh Kulkarni
Pooyan Fazli
392
4
0
18 Apr 2025
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar
Ali Etemad
278
2
0
16 Apr 2025
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation
Xingrui Wang
Jiang-Long Liu
Liang Luo
Xiaodong Yu
Jialian Wu
Xingwu Sun
Yusheng Su
Yaoyao Liu
Zicheng Liu
Emad Barsoum
DiffMVGen
209
4
0
13 Apr 2025
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
Sofian Chaybouti
Walid Bousselham
Moritz Wolter
Hilde Kuehne
796
0
0
07 Apr 2025
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Hanping Zhang
Yuhong Guo
OffRL
240
1
0
03 Apr 2025
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Shuming Liu
Chen Zhao
Tianqi Xu
Bernard Ghanem
VLM
256
20
0
27 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
290
5
0
26 Mar 2025
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger
Nina Wenzel
David Griffiths
Haiming Gang
Justin Lazarow
...
Kai Kang
Marcin Eichner
Yue Yang
Afshin Dehghan
Peter Grasch
258
25
0
17 Mar 2025
LION-FS: Fast & Slow Video-Language Thinker as Online Video AssistantComputer Vision and Pattern Recognition (CVPR), 2025
Wei Li
Bing Hu
Rui Shao
Leyang Shen
Liqiang Nie
223
30
0
05 Mar 2025
ENTER: Event Based Interpretable Reasoning for VideoQA
ENTER: Event Based Interpretable Reasoning for VideoQA
Hammad A. Ayyubi
Junzhang Liu
Ali Asgarov
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
...
Md. Atabuzzaman
Xudong Lin
Naveen Reddy Dyava
Shih-Fu Chang
Chris Thomas
NAI
470
3
0
24 Jan 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token MarksComputer Vision and Pattern Recognition (CVPR), 2025
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjDVLM
484
7
0
14 Jan 2025
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for
  Accelerating Large VLMs
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMsComputer Vision and Pattern Recognition (CVPR), 2024
Wangbo Zhao
Yizeng Han
Jiasheng Tang
Hao Sun
Yibing Song
Kaidi Wang
Zinan Lin
Yang You
405
22
0
04 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
526
5
0
01 Dec 2024
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang
Zhuokai Zhao
Zhaorun Chen
Zenghui Ding
Xianjun Yang
Yining Sun
1.0K
9
0
21 Nov 2024
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S Ryoo
Honglu Zhou
Shrikant B. Kendre
Can Qin
Le Xue
...
Kanchana Ranasinghe
Caiming Xiong
Ran Xu
Caiming Xiong
Juan Carlos Niebles
VGen
250
25
0
21 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoVMLLM
243
18
0
09 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkInternational Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
533
87
0
04 Oct 2024
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Haibo Wang
Zhiyang Xu
Yu Cheng
Shizhe Diao
Jiuxiang Gu
Yixin Cao
Qifan Wang
Weifeng Ge
Lifu Huang
210
52
0
04 Oct 2024
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu
W. Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
SyDaVGen
416
248
0
03 Oct 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
397
140
0
29 May 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
619
155
0
29 Dec 2023
1