Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2409.12961
Cited By
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
19 September 2024
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution"
37 / 37 papers shown
Title
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang
Bo Feng
Zhengfeng Lai
Mingze Xu
Shiyu Li
Weifeng Ge
Afshin Dehghan
Meng Cao
Ping-Chia Huang
OffRL
36
3
0
08 May 2025
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
Yi-Xing Peng
Q. Yang
Yu-Ming Tang
Shenghao Fu
Kun-Yu Lin
Xihan Wei
Wei-Shi Zheng
38
0
0
25 Apr 2025
VEU-Bench: Towards Comprehensive Understanding of Video Editing
Bozheng Li
Y. Wu
Yi Lu
Jiashuo Yu
Licheng Tang
Jiawang Cao
Wenqing Zhu
Yuyang Sun
Jay Wu
Wenbo Zhu
34
0
0
24 Apr 2025
VideoVista-CulturalLingo: 360
∘
^\circ
∘
Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
Xinyu Chen
Yunxin Li
Haoyuan Shi
Baotian Hu
Wenhan Luo
Yaowei Wang
M. Zhang
ELM
57
0
0
23 Apr 2025
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Ziqi Pang
Yu-xiong Wang
VLM
30
0
0
22 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Z. Liu
Shenglong Ye
...
D. Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
W. Wang
MLLM
VLM
54
6
1
14 Apr 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Yang Shi
Jiaheng Liu
Yushuo Guan
Z. Wu
Y. Zhang
...
Bohan Zeng
W. Zhang
Fuzheng Zhang
Wenjing Yang
Di Zhang
VGen
VLM
63
0
0
14 Apr 2025
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Ziyi Wang
Haoran Wu
Yiming Rong
Deyang Jiang
Yixin Zhang
Y. Zhao
Shuang Xu
Bo Xu
VLM
38
0
0
09 Apr 2025
SmolVLM: Redefining small and efficient multimodal models
Andres Marafioti
Orr Zohar
Miquel Farré
Merve Noyan
Elie Bakouch
...
Hugo Larcher
Mathieu Morlon
Lewis Tunstall
Leandro von Werra
Thomas Wolf
VLM
26
4
0
07 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian-Yu Guan
Wei Yu Wu
Rui Yan
VLM
40
0
0
03 Apr 2025
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Min Shi
Shihao Wang
Chieh-Yun Chen
Jitesh Jain
Kai Wang
Junjun Xiong
Guilin Liu
Zhiding Yu
Humphrey Shi
31
1
0
02 Apr 2025
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
34
1
0
02 Apr 2025
Beyond Intermediate States: Explaining Visual Redundancy through Language
Dingchen Yang
Bowen Cao
Anran Zhang
Weibo Gu
Winston Hu
Guang Chen
VLM
66
0
0
26 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
68
0
0
26 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Y. Yang
Afshin Dehghan
43
1
0
24 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zheng Liu
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
86
0
0
24 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
43
0
0
22 Mar 2025
Impossible Videos
Zechen Bai
Hai Ci
Mike Zheng Shou
EGVM
VGen
65
0
0
18 Mar 2025
Quantum EigenGame for excited state calculation
David Quiroga
Jason Han
Anastasios Kyrillidis
38
1
0
17 Mar 2025
ViSpeak: Visual Instruction Feedback in Streaming Videos
Shenghao Fu
Q. Yang
Yuan-Ming Li
Yi-Xing Peng
Kun-Yu Lin
Xihan Wei
Jian-Fang Hu
Xiaohua Xie
Wei-Shi Zheng
VLM
55
1
0
17 Mar 2025
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Zixu Cheng
Jian Hu
Ziquan Liu
Chenyang Si
Wei Li
Shaogang Gong
LRM
56
2
0
14 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Wenhu Chen
Mamba
52
3
0
14 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Z. Wang
Yang Liu
Peng Li
Y. Wang
VLM
55
0
0
13 Mar 2025
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Haiyang Yu
Jinghui Lu
Yanjie Wang
Yang Li
H. Wang
Can Huang
B. Li
VLM
49
1
0
06 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
X. Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Z. Liu
EgoV
43
2
0
05 Mar 2025
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Weitai Kang
Haifeng Huang
Yuzhang Shang
Mubarak Shah
Yan Yan
38
7
0
21 Feb 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
Zhenyu Yang
Y. Hu
Zemin Du
Dizhan Xue
Shengsheng Qian
Jiahong Wu
Fan Yang
W. Dong
Changsheng Xu
36
2
0
15 Feb 2025
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Jiaxing Zhao
Q. Yang
Yixing Peng
Detao Bai
Shimin Yao
...
Xiang Chen
Shenghao Fu
Weixuan chen
Xihan Wei
Liefeng Bo
VGen
AuLLM
48
5
0
28 Jan 2025
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Wenyi Hong
Yean Cheng
Z. Yang
Weihan Wang
Lefan Wang
Xiaotao Gu
Shiyu Huang
Yuxiao Dong
J. Tang
CoGe
VLM
60
4
0
06 Jan 2025
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang
Qingyi Si
Jianlong Wu
Shiyu Zhu
Li Cao
Liqiang Nie
VLM
51
6
0
29 Dec 2024
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
Yipeng Zhang
Y. Liu
Zonghao Guo
Yidan Zhang
Xuesong Yang
...
Yuan Yao
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
Maosong Sun
MLLM
VLM
73
0
0
18 Dec 2024
Do Language Models Understand Time?
Xi Ding
Lei Wang
143
0
0
18 Dec 2024
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
Kangsan Kim
G. Park
Youngwan Lee
Woongyeong Yeo
Sung Ju Hwang
81
3
0
03 Dec 2024
SEAL: Semantic Attention Learning for Long Video Representation
Lan Wang
Yujia Chen
Wen-Sheng Chu
Vishnu Naresh Boddeti
Du Tran
VLM
65
0
0
02 Dec 2024
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Yongdong Luo
Xiawu Zheng
Xiao Yang
Guilin Li
Haojia Lin
Jinfa Huang
Jiayi Ji
Fei Chao
Jiebo Luo
Rongrong Ji
VLM
68
12
0
20 Nov 2024
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
Yunlong Tang
Junjia Guo
Hang Hua
Susan Liang
Mingqian Feng
...
Chao Huang
Jing Bi
Zeliang Zhang
Pooyan Fazli
Chenliang Xu
CoGe
58
8
0
17 Nov 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
31
56
0
29 May 2024
1