Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.16852
Cited By
Long Context Transfer from Language to Vision
24 June 2024
Peiyuan Zhang
Kaichen Zhang
Bo Li
Guangtao Zeng
Jingkang Yang
Yuanhan Zhang
Ziyue Wang
Haoran Tan
Chunyuan Li
Ziwei Liu
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Long Context Transfer from Language to Vision"
33 / 33 papers shown
Title
Static or Dynamic: Towards Query-Adaptive Token Selection for Video Question Answering
Yumeng Shi
Quanyu Long
Wenya Wang
64
0
0
30 Apr 2025
VideoVista-CulturalLingo: 360
∘
^\circ
∘
Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
Xinyu Chen
Yunxin Li
Haoyuan Shi
Baotian Hu
Wenhan Luo
Yaowei Wang
M. Zhang
ELM
59
0
0
23 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Z. Liu
Shenglong Ye
...
D. Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
W. Wang
MLLM
VLM
63
6
1
14 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
K. Zhang
Jinahua Han
Lanqing Hong
Hang Xu
X. Li
MLLM
VLM
71
0
0
08 Apr 2025
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng
Kaixiong Gong
B. Li
Zonghao Guo
Yibing Wang
Tianshuo Peng
J. Wu
Xiaoying Zhang
Benyou Wang
Xiangyu Yue
AI4TS
SyDa
LRM
46
13
0
27 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zheng Liu
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
86
0
0
24 Mar 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Y. Liu
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&Ro
LRM
69
0
0
17 Mar 2025
VITED: Video Temporal Evidence Distillation
Yujie Lu
Yale Song
William Yang Wang
Lorenzo Torresani
Tushar Nagarajan
46
0
0
17 Mar 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
62
0
0
16 Mar 2025
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Yuanxin Liu
Rui Zhu
Shuhuai Ren
Jiacong Wang
Haoyuan Guo
Xu Sun
Lu Jiang
64
1
0
13 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Z. Wang
Yang Liu
Peng Li
Y. Wang
VLM
67
0
0
13 Mar 2025
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
Ruanjun Li
Yuedong Tan
Yuanming Shi
Jiawei Shao
VLM
70
0
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
55
0
0
12 Mar 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di
Zhelun Yu
Guanghao Zhang
Haoyuan Li
Tao Zhong
Hao Cheng
Bolin Li
Wanggui He
Fangxun Shu
Hao Jiang
53
4
0
01 Mar 2025
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Zhongwei Wan
H. Shen
Xin Wang
C. Liu
Zheda Mai
M. Zhang
VLM
54
3
0
24 Feb 2025
Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Leena Mathur
Marian Qian
Paul Pu Liang
Louis-Philippe Morency
LRM
67
1
0
21 Feb 2025
LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
Zican Dong
Junyi Li
Jinhao Jiang
Mingyu Xu
Wayne Xin Zhao
B. Wang
Weipeng Chen
VLM
91
0
0
20 Feb 2025
Magma: A Foundation Model for Multimodal AI Agents
Jianwei Yang
Reuben Tan
Qianhui Wu
Ruijie Zheng
Baolin Peng
...
Seonghyeon Ye
Joel Jang
Yuquan Deng
Lars Liden
Jianfeng Gao
VLM
AI4TS
98
8
0
18 Feb 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Yi Wang
Xinhao Li
Ziang Yan
Yinan He
Jiashuo Yu
...
Kai Chen
Wenhai Wang
Yu Qiao
Yali Wang
Limin Wang
64
19
0
21 Jan 2025
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Ruilin Luo
Zhuofan Zheng
Yifan Wang
Yiyao Yu
Xinzhe Ni
Zicheng Lin
Jin Zeng
Yujiu Yang
LRM
53
12
0
08 Jan 2025
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
61
5
0
31 Dec 2024
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang
Qingyi Si
Jianlong Wu
Shiyu Zhu
Li Cao
Liqiang Nie
VLM
67
6
0
29 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Ying Shan
Xihui Liu
LLMAG
LRM
85
5
0
05 Dec 2024
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng
Yijiang Li
Wanpeng Zhang
Sipeng Zheng
Zongqing Lu
Sipeng Zheng
Zongqing Lu
89
1
0
25 Nov 2024
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Xiangyu Zeng
Kunchang Li
Chenting Wang
Xinhao Li
Tianxiang Jiang
...
Zhengrong Yue
Yi Wang
Yali Wang
Yu Qiao
Limin Wang
MLLM
VLM
AI4TS
55
14
0
25 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoV
MLLM
58
12
0
09 Oct 2024
Temporal Reasoning Transfer from Text to Video
Lei Li
Yuanxin Liu
Linli Yao
Peiyuan Zhang
Chenxin An
Lean Wang
Xu Sun
Lingpeng Kong
Qi Liu
LRM
30
6
0
08 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
52
5
0
04 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
61
25
0
04 Oct 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu
Yuhao Dong
Ziwei Liu
Winston Hu
Jiwen Lu
Yongming Rao
ObjD
61
54
0
19 Sep 2024
Question-Answering Dense Video Events
Hangyu Qin
Junbin Xiao
Angela Yao
VLM
65
1
0
06 Sep 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
135
895
0
21 Dec 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
244
4,186
0
30 Jan 2023
1