ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.06355
  4. Cited By
VideoChat: Chat-Centric Video Understanding
v1v2 (latest)

VideoChat: Chat-Centric Video Understanding

10 May 2023
Kunchang Li
Yinan He
Yi Wang
Yizhuo Li
Wen Wang
Ping Luo
Yali Wang
Limin Wang
Yu Qiao
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)Github (3246★)

Papers citing "VideoChat: Chat-Centric Video Understanding"

50 / 555 papers shown
Title
ReWind: Understanding Long Videos with Instructed Learnable Memory
ReWind: Understanding Long Videos with Instructed Learnable MemoryComputer Vision and Pattern Recognition (CVPR), 2024
Anxhelo Diko
Tinghuai Wang
Wassim Swaileh
Shiyan Sun
Ioannis Patras
KELMVLM
299
4
0
23 Nov 2024
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in
  Hour-Long Videos
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long VideosComputer Vision and Pattern Recognition (CVPR), 2024
Tanveer Hannan
Md. Mohaiminul Islam
Jindong Gu
Thomas Seidl
Gedas Bertasius
VLM
170
9
0
22 Nov 2024
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
  Video Reasoning via Core Frame Selection
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame SelectionComputer Vision and Pattern Recognition (CVPR), 2024
Songhao Han
Wei Huang
Hairong Shi
Le Zhuo
Xiu Su
Shifeng Zhang
Xu Zhou
Xiaojuan Qi
Yue Liao
Si Liu
VGenLRM
230
45
0
22 Nov 2024
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Yongdong Luo
Xiawu Zheng
Guilin Li
Guilin Li
Haojia Lin
...
Jinfa Huang
Jiayi Ji
Jiebo Luo
Rongrong Ji
Rongrong Ji
VLM
538
67
0
20 Nov 2024
On the Consistency of Video Large Language Models in Temporal Comprehension
On the Consistency of Video Large Language Models in Temporal ComprehensionComputer Vision and Pattern Recognition (CVPR), 2024
Minjoon Jung
Junbin Xiao
Byoung-Tak Zhang
Angela Yao
398
5
0
20 Nov 2024
Generative Timelines for Instructed Visual Assembly
Generative Timelines for Instructed Visual Assembly
Alejandro Pardo
Jui-hsien Wang
Guohao Li
Josef Sivic
Bryan C. Russell
Fabian Caba Heilbron
VGen
207
0
0
19 Nov 2024
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Tingyu Qu
Mingxiao Li
Tinne Tuytelaars
Marie-Francine Moens
VLM
224
3
0
17 Nov 2024
Spider: Any-to-Many Multimodal LLM
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
440
4
0
14 Nov 2024
VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models
VideoCogQA: A Controllable Benchmark for Evaluating Cognitive Abilities in Video-Language Models
Chenglin Li
Qianglong Chen
Zhi Li
Feng Tao
Yin Zhang
350
0
0
14 Nov 2024
Multimodal Instruction Tuning with Hybrid State Space Models
Multimodal Instruction Tuning with Hybrid State Space Models
Jianing Zhou
Han Li
Shuai Zhang
Ning Xie
Ruijie Wang
Xiaohan Nie
Sheng Liu
Lingyun Wang
205
0
0
13 Nov 2024
Artificial Intelligence for Biomedical Video Generation
Artificial Intelligence for Biomedical Video Generation
Linyuan Li
Jianing Qiu
Anujit Saha
Lin Li
Poyuan Li
Mengxian He
Ziyu Guo
Wu Yuan
VGen
334
2
0
12 Nov 2024
EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation
EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation
Hao Liang
Zirong Chen
Feiyu Xiong
Wentao Zhang
226
2
0
11 Nov 2024
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
Yichen He
Yuan Lin
Jianchao Wu
Hanchong Zhang
Yuchen Zhang
Ruicheng Le
VGenVLM
654
4
0
11 Nov 2024
HourVideo: 1-Hour Video-Language Understanding
HourVideo: 1-Hour Video-Language UnderstandingNeural Information Processing Systems (NeurIPS), 2024
Keshigeyan Chandrasegaran
Agrim Gupta
Lea M. Hadzic
Taran Kota
Jimming He
Cristobal Eyzaguirre
Zane Durante
Pengfei Yu
Jiajun Wu
L. Fei-Fei
VLM
213
82
0
07 Nov 2024
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in VideosComputer Vision and Pattern Recognition (CVPR), 2024
Shehan Munasinghe
Hanan Gani
Wenqi Zhu
Jiale Cao
Eric P. Xing
Fahad Shahbaz Khan
Salman Khan
MLLMVGenVLM
375
28
0
07 Nov 2024
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Ruyang Liu
Haoran Tang
Haibo Liu
Yixiao Ge
Mingyu Ding
Chen Li
Jiankun Yang
VLM
170
17
0
04 Nov 2024
LLaMo: Large Language Model-based Molecular Graph Assistant
LLaMo: Large Language Model-based Molecular Graph AssistantNeural Information Processing Systems (NeurIPS), 2024
Jinyoung Park
Minseong Bae
Dohwan Ko
Hyunwoo J. Kim
199
15
0
31 Oct 2024
MotionGPT-2: A General-Purpose Motion-Language Model for Motion
  Generation and Understanding
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding
Yuan Wang
Di Huang
Yaqi Zhang
Wanli Ouyang
J. Jiao
Xuetao Feng
Yan Zhou
Pengfei Wan
Weizhen He
Dan Xu
VGen
185
35
0
29 Oct 2024
VLMimic: Vision Language Models are Visual Imitation Learner for
  Fine-grained Actions
VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained ActionsNeural Information Processing Systems (NeurIPS), 2024
Guanyan Chen
Ming Wang
Te Cui
Yao Mu
Haizhou Li
...
Haoyang Lu
Guangyan Chen
Yuchen Ren
Yi Yang
Yufeng Yue
VLM
253
11
0
28 Oct 2024
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for
  Multi-Modal Tobacco Content Analysis
FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis
N. V. R. Chappa
P. Dobbs
Bhiksha Raj
Khoa Luu
266
3
0
25 Oct 2024
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded TuningInternational Conference on Learning Representations (ICLR), 2024
Xiangyu Zeng
Kunchang Li
Chenting Wang
Xinhao Li
Tianxiang Jiang
...
Zhengrong Yue
Yi Wang
Yali Wang
Yu Qiao
Limin Wang
MLLMVLMAI4TS
227
52
0
25 Oct 2024
Foundation Models for Rapid Autonomy Validation
Foundation Models for Rapid Autonomy ValidationIEEE International Conference on Robotics and Automation (ICRA), 2024
Alec Farid
Peter Schleede
Aaron Huang
Christoffer Heckman
290
0
0
22 Oct 2024
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S Ryoo
Honglu Zhou
Shrikant B. Kendre
Can Qin
Le Xue
...
Kanchana Ranasinghe
Caiming Xiong
Ran Xu
Caiming Xiong
Juan Carlos Niebles
VGen
250
25
0
21 Oct 2024
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning
Shiyu Hu
Xuchen Li
Xuzhao Li
Jing Zhang
Yipei Wang
Xin Zhao
Kang Hao Cheong
VLM
223
3
0
20 Oct 2024
TransAgent: Transfer Vision-Language Foundation Models with
  Heterogeneous Agent Collaboration
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent CollaborationNeural Information Processing Systems (NeurIPS), 2024
Yiwei Guo
Shaobin Zhuang
Kunchang Li
Yu Qiao
Yali Wang
VLMCLIP
337
5
0
16 Oct 2024
OMCAT: Omni Context Aware Transformer
OMCAT: Omni Context Aware Transformer
Arushi Goel
Karan Sapra
Matthieu Le
Rafael Valle
Andrew Tao
Bryan Catanzaro
MLLMVLM
188
2
0
15 Oct 2024
LocoMotion: Learning Motion-Focused Video-Language Representations
LocoMotion: Learning Motion-Focused Video-Language RepresentationsAsian Conference on Computer Vision (ACCV), 2024
Hazel Doughty
Fida Mohammad Thoker
Cees G. M. Snoek
325
3
0
15 Oct 2024
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for
  Embodied AI
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
Sijie Cheng
Kechen Fang
Yangyang Yu
Sicheng Zhou
Yangqiu Song
Ye Tian
Tingguang Li
Lei Han
Yang Liu
193
14
0
15 Oct 2024
Free Video-LLM: Prompt-guided Visual Perception for Efficient
  Training-free Video LLMs
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs
Kai Han
Jianyuan Guo
Yehui Tang
W. He
Enhua Wu
Yunhe Wang
MLLMVLM
173
16
0
14 Oct 2024
Depth Any Video with Scalable Synthetic Data
Depth Any Video with Scalable Synthetic DataInternational Conference on Learning Representations (ICLR), 2024
Honghui Yang
Di Huang
Wei Yin
Chunhua Shen
Haifeng Liu
Xiaofei He
Binbin Lin
Wanli Ouyang
Tong He
VGenMDE
270
16
0
14 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2024
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLMMLLM
321
64
0
10 Oct 2024
Temporal Reasoning Transfer from Text to Video
Temporal Reasoning Transfer from Text to VideoInternational Conference on Learning Representations (ICLR), 2024
Lei Li
Yuanxin Liu
Linli Yao
Peiyuan Zhang
Chenxin An
Lean Wang
Xu Sun
Dianbo Sui
Qi Liu
LRM
143
20
0
08 Oct 2024
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
TRACE: Temporal Grounding Video LLM via Causal Event ModelingInternational Conference on Learning Representations (ICLR), 2024
Yongxin Guo
Jingyu Liu
Mingda Li
Xiaoying Tang
Qingbin Liu
Xiaoying Tang
221
43
0
08 Oct 2024
Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava
Human-in-the-loop Reasoning For Traffic Sign Detection: Collaborative Approach Yolo With Video-llava
Mehdi Azarafza
Fatima Idrees
Ali Ehteshami Bejnordi
Charles Steinmetz
Stefan Henkler
A. Rettberg
253
1
0
07 Oct 2024
Realizing Video Summarization from the Path of Language-based Semantic
  Understanding
Realizing Video Summarization from the Path of Language-based Semantic Understanding
Kuan-Chen Mu
Zhi-Yi Chin
Wei-Chen Chiu
109
0
0
06 Oct 2024
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Haibo Wang
Zhiyang Xu
Yu Cheng
Shizhe Diao
Jiuxiang Gu
Yixin Cao
Qifan Wang
Weifeng Ge
Lifu Huang
210
50
0
04 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Frame-Voyager: Learning to Query Frames for Video Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
275
35
0
04 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkInternational Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
517
87
0
04 Oct 2024
Open-vocabulary Multimodal Emotion Recognition: Dataset, Metric, and
  Benchmark
Open-vocabulary Multimodal Emotion Recognition: Dataset, Metric, and Benchmark
Zheng Lian
Haiyang Sun
Guoying Zhao
Lan Chen
Haoyu Chen
...
Rui Liu
Shan Liang
Ya Li
Jiangyan Yi
Jianhua Tao
VLM
257
6
0
02 Oct 2024
UAL-Bench: The First Comprehensive Unusual Activity Localization
  Benchmark
UAL-Bench: The First Comprehensive Unusual Activity Localization BenchmarkIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Hasnat Md Abdullah
Tian Liu
Kangda Wei
Shu Kong
Ruihong Huang
223
5
0
02 Oct 2024
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Sara Ghazanfari
Alexandre Araujo
Prashanth Krishnamurthy
Siddharth Garg
Farshad Khorrami
VLM
231
7
0
02 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLMMLLM
263
64
1
30 Sep 2024
Efficient Driving Behavior Narration and Reasoning on Edge Device Using
  Large Language Models
Efficient Driving Behavior Narration and Reasoning on Edge Device Using Large Language ModelsIEEE Transactions on Vehicular Technology (IEEE Trans. Veh. Technol.), 2024
Yizhou Huang
Yihua Cheng
Kezhi Wang
LRM
128
3
0
30 Sep 2024
Visual Context Window Extension: A New Perspective for Long Video
  Understanding
Visual Context Window Extension: A New Perspective for Long Video Understanding
Hongchen Wei
Zhenzhong Chen
VLM
258
1
0
30 Sep 2024
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
  Videos
One Token to Seg Them All: Language Instructed Reasoning Segmentation in VideosNeural Information Processing Systems (NeurIPS), 2024
Zechen Bai
Tong He
Haiyang Mei
Pichao Wang
Ziteng Gao
Joya Chen
Lei Liu
Zheng Zhang
Mike Zheng Shou
VLMVOSMLLM
203
67
0
29 Sep 2024
Video DataFlywheel: Resolving the Impossible Data Trinity in
  Video-Language Understanding
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language UnderstandingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Xiao Wang
Yue Yu
Zijia Lin
Fuzheng Zhang
Di Zhang
Liqiang Nie
VGen
151
5
0
29 Sep 2024
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
E.T. Bench: Towards Open-Ended Event-Level Video-Language UnderstandingNeural Information Processing Systems (NeurIPS), 2024
Ye Liu
Zongyang Ma
Chen Ma
Yang Wu
Ying Shan
Chang Wen Chen
199
47
0
26 Sep 2024
LLM4Brain: Training a Large Language Model for Brain Video Understanding
LLM4Brain: Training a Large Language Model for Brain Video Understanding
Ruizhe Zheng
Lichao Sun
129
2
0
26 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLMAuLLM
314
19
0
26 Sep 2024
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness
Chenming Zhu
Tai Wang
Wenwei Zhang
Jiangmiao Pang
Xihui Liu
540
107
0
26 Sep 2024
Previous
123...567...101112
Next