ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.17043
  4. Cited By
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

28 November 2023
Yanwei Li
Chengyao Wang
Jiaya Jia
    VLM
    MLLM
ArXivPDFHTML

Papers citing "LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models"

50 / 202 papers shown
Title
Vision and Intention Boost Large Language Model in Long-Term Action Anticipation
Vision and Intention Boost Large Language Model in Long-Term Action Anticipation
Congqi Cao
Lanshu Hu
Yating Yu
Y. Zhang
VLM
42
0
0
03 May 2025
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM
Jiaxu Qian
Chendong Wang
Y. Yang
Chaoyun Zhang
Huiqiang Jiang
...
Saravan Rajmohan
Dongmei Zhang
Y. Yang
Qi Zhang
Lili Qiu
VLM
65
0
0
30 Apr 2025
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding
Rong Gao
Xin Liu
Zhuozhao Hu
Bohao Xing
Baiqiang Xia
Zitong Yu
H. Kalviainen
38
0
0
28 Apr 2025
Learning Streaming Video Representation via Multitask Training
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
76
0
0
28 Apr 2025
A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based, Generative to Agentic Paradigms
A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based, Generative to Agentic Paradigms
Chengkai Huang
Hongtao Huang
Tong Yu
Kaige Xie
Junda Wu
Shuai Zhang
Julian McAuley
Dietmar Jannach
Lina Yao
LRM
AI4CE
17
0
0
23 Apr 2025
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Z. Wang
Senthil Purushwalkam
Caiming Xiong
S.
Heng Ji
R. Xu
28
0
0
23 Apr 2025
MR. Video: "MapReduce" is the Principle for Long Video Understanding
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Ziqi Pang
Yu-xiong Wang
VLM
32
0
0
22 Apr 2025
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
David Ma
Y. Zhang
J. Ren
Jarvis Guo
Yifan Yao
...
Shiwen Ni
J. H. Liu
Wenhao Huang
Ge Zhang
Xiaojie Jin
VLM
32
0
0
21 Apr 2025
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes
Ji Qi
Y. Yao
Yushi Bai
Bin Xu
Juanzi Li
Zhiyuan Liu
Tat-Seng Chua
29
0
0
21 Apr 2025
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
Enxin Song
Wenhao Chai
Weili Xu
Jianwen Xie
Yuxuan Liu
Gaoang Wang
54
0
0
20 Apr 2025
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models
Haojian Huang
Haodong Chen
Shengqiong Wu
Meng Luo
Jinlan Fu
Xinya Du
H. Zhang
Hao Fei
AI4TS
58
0
0
17 Apr 2025
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar
Ali Etemad
22
0
0
16 Apr 2025
Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset
Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset
Elisa Ancarani
Julie Tores
L. Sassatelli
Rémy Sun
Hui-Yin Wu
F. Precioso
19
0
0
15 Apr 2025
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Haoran Hao
Jiaming Han
Yiyuan Zhang
Xiangyu Yue
30
0
0
14 Apr 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
Yang Shi
Jiaheng Liu
Yushuo Guan
Z. Wu
Y. Zhang
...
Bohan Zeng
W. Zhang
Fuzheng Zhang
Wenjing Yang
Di Zhang
VGen
VLM
63
0
0
14 Apr 2025
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Xingjian Zhang
Siwei Wen
Wenjun Wu
Lei Huang
LRM
21
1
0
13 Apr 2025
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding
Yangliu Hu
Zikai Song
Na Feng
Yawei Luo
Junqing Yu
Yi-Ping Phoebe Chen
Wei Yang
30
0
0
10 Apr 2025
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding
Henghao Zhao
Ge-Peng Ji
Rui Yan
Huan Xiong
Zechao Li
16
0
0
10 Apr 2025
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding
Ziyi Wang
Haoran Wu
Yiming Rong
Deyang Jiang
Yixin Zhang
Y. Zhao
Shuang Xu
Bo Xu
VLM
41
0
0
09 Apr 2025
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning
Xinpeng Ding
K. Zhang
Jinahua Han
Lanqing Hong
Hang Xu
X. Li
MLLM
VLM
66
0
0
08 Apr 2025
LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts
LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts
Yimu Wang
Mozhgan Nasr Azadani
Sean Sedwards
Krzysztof Czarnecki
MLLM
MoE
47
0
0
07 Apr 2025
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
Alkesh Patel
Vibhav Chitalia
Yinfei Yang
23
0
0
06 Apr 2025
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
Zhuo Zhi
Qiangqiang Wu
Minghe shen
W. J. Li
Yinchuan Li
Kun Shao
Kaiwen Zhou
LLMAG
28
0
0
06 Apr 2025
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Hanping Zhang
Yuhong Guo
OffRL
26
0
0
03 Apr 2025
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Chuanqi Cheng
Jian-Yu Guan
Wei Yu Wu
Rui Yan
VLM
40
0
0
03 Apr 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLM
AuLLM
58
2
0
02 Apr 2025
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
36
1
0
02 Apr 2025
Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning
Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning
Yiting Lu
X. Li
H. Wu
Bingchen Li
Weisi Lin
Zhibo Chen
37
1
0
02 Apr 2025
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
Kai Huang
Hao Zou
Bochen Wang
Ye Xi
Zhen Xie
Hao Wang
VLM
37
0
0
31 Mar 2025
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Y. Wang
Y. Wang
Bo Chen
Tong Wu
Dongyan Zhao
Zilong Zheng
VLM
MLLM
49
1
0
29 Mar 2025
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen
Bohan Liu
Chenjia Li
Lalithkumar Seenivasan
Mathias Unberath
VOS
67
2
0
27 Mar 2025
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Shuming Liu
Chen Zhao
Tianqi Xu
Bernard Ghanem
VLM
69
0
0
27 Mar 2025
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abdelrahman M. Shaker
Muhammad Maaz
Chenhui Gou
Hamid Rezatofighi
Salman Khan
F. Khan
44
0
0
27 Mar 2025
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng
Kaixiong Gong
B. Li
Zonghao Guo
Yibing Wang
Tianshuo Peng
Benyou Wang
Xiangyu Yue
SyDa
AI4TS
LRM
46
13
0
27 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
70
0
0
26 Mar 2025
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Hongcheng Gao
Jiashu Qu
Jingyi Tang
Baolong Bi
Y. Liu
Hongyu Chen
Li Liang
Li Su
Qingming Huang
MLLM
VLM
LRM
72
3
0
25 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zheng Liu
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
86
0
0
24 Mar 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
...
Jinghua Yan
Y. Bai
P. Sadayappan
Xia Hu
Bo Yuan
VLM
49
0
0
24 Mar 2025
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Handong Li
Yiyuan Zhang
Longteng Guo
Xiangyu Yue
Jing Liu
VLM
67
0
0
24 Mar 2025
PVChat: Personalized Video Chat with One-Shot Learning
PVChat: Personalized Video Chat with One-Shot Learning
Yufei Shi
Weilong Yan
Gang Xu
Yumeng Li
Y. Li
Z. Li
Fei Richard Yu
Ming Li
Si Yong Yeo
38
0
0
21 Mar 2025
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae
Jinhyung Kim
Sihaeng Lee
Soonyoung Lee
G. Lee
Jinwoo Choi
62
1
0
20 Mar 2025
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Keda Tao
Haoxuan You
Yang Sui
Can Qin
H. Wang
VLM
MQ
79
0
0
20 Mar 2025
GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation
GenM3^33: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation
Junyu Shi
Lijiang Liu
Yong Sun
Zhiyuan Zhang
Jinni Zhou
Qiang Nie
50
0
0
19 Mar 2025
Efficient Motion-Aware Video MLLM
Efficient Motion-Aware Video MLLM
Zijia Zhao
Yuqi Huo
Tongtian Yue
Longteng Guo
Haoyu Lu
B. Wang
Weipeng Chen
J. Liu
50
0
0
17 Mar 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
JianXiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
42
1
0
17 Mar 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
57
0
0
16 Mar 2025
Similarity-Aware Token Pruning: Your VLM but Faster
Ahmadreza Jeddi
Negin Baghbanzadeh
Elham Dolatabadi
Babak Taati
3DV
VLM
50
1
0
14 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Z. Wang
Yang Liu
Peng Li
Y. Wang
VLM
61
0
0
13 Mar 2025
TIME: Temporal-sensitive Multi-dimensional Instruction Tuning and Benchmarking for Video-LLMs
Yunxiao Wang
Meng Liu
Rui Shao
Haoyu Zhang
Bin Wen
Fan Yang
Tingting Gao
Di Zhang
Liqiang Nie
54
1
0
13 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
50
0
0
12 Mar 2025
12345
Next