ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXivPDFHTML

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 697 papers shown
Title
MMSearch: Benchmarking the Potential of Large Models as Multi-modal
  Search Engines
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Dongzhi Jiang
Renrui Zhang
Ziyu Guo
Yanmin Wu
Jiayi Lei
...
Guanglu Song
Peng Gao
Yu Liu
Chunyuan Li
Hongsheng Li
MLLM
27
16
0
19 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
34
1
0
19 Sep 2024
Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment
Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment
Jin Chen
Kaijing Ma
Haojian Huang
Jiayu Shen
Han Fang
Xianghao Zang
Chao Ban
79
2
0
17 Sep 2024
Generating Event-oriented Attribution for Movies via Two-Stage
  Prefix-Enhanced Multimodal LLM
Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM
Yuanjie Lyu
Tong Bill Xu
Zihan Niu
Bo Peng
Jing Ke
Enhong Chen
23
0
0
14 Sep 2024
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Yang Liu
Pengxiang Ding
Siteng Huang
Min Zhang
H. Zhao
Donglin Wang
24
5
0
11 Sep 2024
1M-Deepfakes Detection Challenge
1M-Deepfakes Detection Challenge
Zhixi Cai
Abhinav Dhall
Shreya Ghosh
Munawar Hayat
D. Kollias
Kalin Stefanov
Usman Tariq
26
1
0
11 Sep 2024
Question-Answering Dense Video Events
Question-Answering Dense Video Events
Hangyu Qin
Junbin Xiao
Angela Yao
VLM
71
1
0
06 Sep 2024
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with
  Multimodal Large Language Models
StimuVAR: Spatiotemporal Stimuli-aware Video Affective Reasoning with Multimodal Large Language Models
Y. Guo
Faizan Siddiqui
Yang Zhao
Rama Chellappa
Shao-Yuan Lo
LRM
24
2
0
31 Aug 2024
HERMES: temporal-coHERent long-forM understanding with Episodes and
  Semantics
HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics
Gueter Josmy Faure
Jia-Fong Yeh
Min-Hung Chen
Hung-Ting Su
Winston H. Hsu
Shang-Hong Lai
26
3
0
30 Aug 2024
A longitudinal sentiment analysis of Sinophobia during COVID-19 using
  large language models
A longitudinal sentiment analysis of Sinophobia during COVID-19 using large language models
Chen Wang
Rohitash Chandra
21
0
0
29 Aug 2024
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths
  Vision Computation
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
Shiwei Wu
Joya Chen
Kevin Qinghong Lin
Qimeng Wang
Yan Gao
Qianli Xu
Tong Bill Xu
Yao Hu
Enhong Chen
Mike Zheng Shou
VLM
45
12
0
29 Aug 2024
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Zhifei Xie
Changqiao Wu
AuLLM
VGen
VLM
SyDa
LRM
29
55
0
29 Aug 2024
CogVLM2: Visual Language Models for Image and Video Understanding
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong
Weihan Wang
Ming Ding
Wenmeng Yu
Qingsong Lv
...
Debing Liu
Bin Xu
Juanzi Li
Yuxiao Dong
Jie Tang
VLM
MLLM
45
88
0
29 Aug 2024
Training-free Video Temporal Grounding using Large-scale Pre-trained
  Models
Training-free Video Temporal Grounding using Large-scale Pre-trained Models
Minghang Zheng
Xinhao Cai
Qingchao Chen
Yuxin Peng
Yang Liu
32
4
0
29 Aug 2024
Video-CCAM: Enhancing Video-Language Understanding with Causal
  Cross-Attention Masks for Short and Long Videos
Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos
Jiajun Fei
Dian Li
Zhidong Deng
Zekun Wang
Gang Liu
Hui Wang
VLM
35
34
0
26 Aug 2024
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models
LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models
Qihang Ge
Wei Sun
Yu Zhang
Yunhao Li
Zhongpeng Ji
Fengyu Sun
Shangling Jui
Xiongkuo Min
Guangtao Zhai
41
4
0
26 Aug 2024
T3M: Text Guided 3D Human Motion Synthesis from Speech
T3M: Text Guided 3D Human Motion Synthesis from Speech
Wenshuo Peng
Kaipeng Zhang
Sai Qian Zhang
20
0
0
23 Aug 2024
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
Haojun Shi
Suyu Ye
Xinyu Fang
Chuanyang Jin
Leyla Isik
Yen-Ling Kuo
Tianmin Shu
LLMAG
63
7
0
22 Aug 2024
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for
  Saliency Prediction with Diffusion
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
Yunlong Tang
Gen Zhan
Li Yang
Yiting Liao
Chenliang Xu
VGen
DiffM
LRM
37
8
0
21 Aug 2024
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction
  Tuning
EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning
Bohao Xing
Zitong Yu
Xin Liu
Kaishen Yuan
Qilang Ye
Weicheng Xie
Huanjing Yue
Jingyu Yang
H. Kalviainen
48
10
0
21 Aug 2024
Video Emotion Open-vocabulary Recognition Based on Multimodal Large
  Language Model
Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model
Mengying Ge
Dongkai Tang
Mingyang Li
VLM
17
1
0
21 Aug 2024
SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for
  Multimodal Emotion Recognition
SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition
Zebang Cheng
Shuyuan Tu
Dawei Huang
Minghan Li
Xiaojiang Peng
Zhi-Qi Cheng
Alexander G. Hauptmann
43
2
0
20 Aug 2024
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang
Jiayan Teng
Wendi Zheng
Ming Ding
Shiyu Huang
...
Weihan Wang
Yean Cheng
Xiaotao Gu
Yuxiao Dong
Jie Tang
DiffM
VGen
72
389
0
12 Aug 2024
Egocentric Vision Language Planning
Egocentric Vision Language Planning
Zhirui Fang
Ming Yang
Weishuai Zeng
Boyu Li
Junpeng Yue
Ziluo Ding
Xiu Li
Zongqing Lu
LM&Ro
34
1
0
11 Aug 2024
VideoQA in the Era of LLMs: An Empirical Study
VideoQA in the Era of LLMs: An Empirical Study
Junbin Xiao
Nanxin Huang
Hangyu Qin
Dongyang Li
Yicong Li
...
Zhulin Tao
Jianxing Yu
Liang Lin
Tat-Seng Chua
Angela Yao
23
10
0
08 Aug 2024
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready
  Multimodal Conversational Agents
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents
Qiang Sun
Yuanyi Luo
Sirui Li
Wenxiao Zhang
Wei Liu
AuLLM
LLMAG
VLM
23
2
0
06 Aug 2024
Latent-INR: A Flexible Framework for Implicit Representations of Videos
  with Discriminative Semantics
Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
Shishira R. Maiya
Anubhav Gupta
M. Gwilliam
Max Ehrlich
Abhinav Shrivastava
33
3
1
05 Aug 2024
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks
  With Large Language Model
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model
Zhaowei Li
Wei Wang
Yiqing Cai
Xu Qi
Pengyu Wang
Dong Zhang
Hang Song
Botian Jiang
Zhida Huang
Tao Wang
AIFin
LRM
40
3
0
05 Aug 2024
Infusing Environmental Captions for Long-Form Video Language Grounding
Infusing Environmental Captions for Long-Form Video Language Grounding
Hyogun Lee
Soyeon Hong
Mujeen Sung
Jinwoo Choi
33
0
0
05 Aug 2024
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance
Mrinal Verghese
Brian Chen
H. Eghbalzadeh
Tushar Nagarajan
Ruta Desai
LRM
45
1
0
04 Aug 2024
Multi-Frame Vision-Language Model for Long-form Reasoning in Driver
  Behavior Analysis
Multi-Frame Vision-Language Model for Long-form Reasoning in Driver Behavior Analysis
Hiroshi Takato
Hiroshi Tsutsui
Komei Soda
Hidetaka Kamigaito
VLM
26
0
0
03 Aug 2024
A Comprehensive Review of Multimodal Large Language Models: Performance
  and Challenges Across Different Tasks
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks
Jiaqi Wang
Hanqi Jiang
Yi-Hsueh Liu
Chong Ma
Xu-Yao Zhang
...
Xin Zhang
Wei Zhang
Dinggang Shen
Tianming Liu
Shu Zhang
VLM
AI4TS
42
30
0
02 Aug 2024
SynesLM: A Unified Approach for Audio-visual Speech Recognition and
  Translation via Language Model and Synthetic Data
SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data
Yichen Lu
Álvaro Huertas-García
Xuankai Chang
Hengwei Bian
Soumi Maiti
Shinji Watanabe
37
2
0
01 Aug 2024
MMTrail: A Multimodal Trailer Video Dataset with Language and Music
  Descriptions
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Xiaowei Chi
Yatian Wang
Aosong Cheng
Pengjun Fang
Zeyue Tian
...
Wenhan Luo
Qifeng Chen
Shanghang Zhang
Qi-fei Liu
Yi-Ting Guo
67
7
0
30 Jul 2024
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language
  Models
CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models
Junda Wu
Xintong Li
Tong Yu
Yu-Xiang Wang
Xiang Chen
Jiuxiang Gu
Lina Yao
Jingbo Shang
Julian McAuley
37
0
0
29 Jul 2024
EPD: Long-term Memory Extraction, Context-awared Planning and
  Multi-iteration Decision @ EgoPlan Challenge ICML 2024
EPD: Long-term Memory Extraction, Context-awared Planning and Multi-iteration Decision @ EgoPlan Challenge ICML 2024
Letian Shi
Qi Lv
Xiang Deng
Liqiang Nie
23
1
0
28 Jul 2024
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
Ruiyi Zhang
Yufan Zhou
Jian Chen
Jiuxiang Gu
Changyou Chen
Tongfei Sun
VLM
34
6
0
27 Jul 2024
MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image
  Relational Association Capabilities in Large Visual Language Models
MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models
Siwei Wu
Kang Zhu
Yu Bai
Yiming Liang
Yizhi Li
...
Xingwei Qu
Xuxin Cheng
Ge Zhang
Wenhao Huang
Chenghua Lin
VLM
24
2
0
24 Jul 2024
MicroEmo: Time-Sensitive Multimodal Emotion Recognition with
  Micro-Expression Dynamics in Video Dialogues
MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues
Liyun Zhang
33
1
0
23 Jul 2024
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
  Models
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Mingze Xu
Mingfei Gao
Zhe Gan
Hong-You Chen
Zhengfeng Lai
Haiming Gang
Kai Kang
Afshin Dehghan
48
48
0
22 Jul 2024
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language
  Understanding
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu
Dongxu Li
Bei Chen
Junnan Li
33
105
0
22 Jul 2024
WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained
  Spatial-Temporal Understanding
WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding
Quan Kong
Yuki Kawana
Rajat Saini
Ashutosh Kumar
Jingjing Pan
...
Yohei Ozao
Balázs Opra
D. Anastasiu
Yoichi Sato
N. Kobori
VGen
27
7
0
22 Jul 2024
DOPRA: Decoding Over-accumulation Penalization and Re-allocation in
  Specific Weighting Layer
DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer
Jinfeng Wei
Xiaofeng Zhang
21
12
0
21 Jul 2024
Navigation Instruction Generation with BEV Perception and Large Language
  Models
Navigation Instruction Generation with BEV Perception and Large Language Models
Sheng Fan
Rui Liu
Wenguan Wang
Yi Yang
40
5
0
21 Jul 2024
End-to-End Video Question Answering with Frame Scoring Mechanisms and
  Adaptive Sampling
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
Jianxin Liang
Xiaojun Meng
Yueqian Wang
Chang Liu
Qun Liu
Dongyan Zhao
27
5
0
21 Jul 2024
Audio-visual training for improved grounding in video-text LLMs
Audio-visual training for improved grounding in video-text LLMs
Shivprasad Sagare
Hemachandran S
Kinshuk Sarabhai
Prashant Ullegaddi
SA Rajeshkumar
27
0
0
21 Jul 2024
A Comprehensive Review of Few-shot Action Recognition
A Comprehensive Review of Few-shot Action Recognition
Yuyang Wanyan
Xiaoshan Yang
Weiming Dong
Changsheng Xu
VLM
61
3
0
20 Jul 2024
On Pre-training of Multimodal Language Models Customized for Chart
  Understanding
On Pre-training of Multimodal Language Models Customized for Chart Understanding
Wan-Cyuan Fan
Yen-Chun Chen
Mengchen Liu
Lu Yuan
Leonid Sigal
36
5
0
19 Jul 2024
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Mingchen Zhuge
Jian Ding
Deyao Zhu
Jürgen Schmidhuber
Mohamed Elhoseiny
VLM
17
17
0
17 Jul 2024
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Jie-jin Yang
Xuesong Niu
Nan Jiang
Ruimao Zhang
Siyuan Huang
30
9
0
17 Jul 2024
Previous
123...678...121314
Next