Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2412.02071
Cited By
v1
v2 (latest)
Progress-Aware Video Frame Captioning
Computer Vision and Pattern Recognition (CVPR), 2024
3 December 2024
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
Re-assign community
ArXiv (abs)
PDF
HTML
Github
Papers citing
"Progress-Aware Video Frame Captioning"
50 / 69 papers shown
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
Xinlong Chen
Yue Ding
Weihong Lin
Jingyun Hua
Linli Yao
...
Yuanxing Zhang
Qiang Liu
Pengfei Wan
Liang Wang
Tieniu Tan
294
8
0
12 Oct 2025
It's Just Another Day: Unique Video Captioning by Discriminative Prompting
Asian Conference on Computer Vision (ACCV), 2024
Toby Perrett
Tengda Han
Dima Damen
Andrew Zisserman
289
3
0
15 Oct 2024
When Does Perceptual Alignment Benefit Vision Representations?
Neural Information Processing Systems (NeurIPS), 2024
Shobhita Sundaram
Stephanie Fu
Lukas Muttenthaler
Netanel Y. Tamir
Lucy Chai
Simon Kornblith
Trevor Darrell
Phillip Isola
354
56
1
14 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
International Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
827
118
0
04 Oct 2024
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu
W. Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
SyDa
VGen
584
248
0
03 Oct 2024
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
Computer Vision and Pattern Recognition (CVPR), 2024
Qirui Jiao
Daoyuan Chen
Yilun Huang
Yaliang Li
Ying Shen
VLM
338
17
0
08 Aug 2024
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li
Yuanhan Zhang
Dong Guo
Renrui Zhang
Feng Li
Hao Zhang
Kaichen Zhang
Yanwei Li
Ziwei Liu
Chunyuan Li
MLLM
SyDa
VLM
797
2,258
0
06 Aug 2024
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li
Renrui Zhang
Hao Zhang
Yuanhan Zhang
Bo Li
Wei Li
Zejun Ma
Chunyuan Li
MLLM
VLM
516
534
0
10 Jul 2024
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang
Liping Yuan
Yuchen Zhang
351
135
0
30 Jun 2024
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
Jr-Jen Chen
Yu-Chien Liao
Hsi-Che Lin
Yu-Chu Yu
Yen-Chun Chen
Yu-Chiang Frank Wang
438
50
0
27 Jun 2024
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Yuxuan Wang
Yueqian Wang
Dongyan Zhao
Cihang Xie
Zilong Zheng
MLLM
VLM
324
66
0
24 Jun 2024
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
International Conference on Learning Representations (ICLR), 2024
Asmar Nadeem
Faegheh Sardari
R. Dawes
Syed Sameed Husain
Adrian Hilton
Armin Mustafa
515
10
0
10 Jun 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Neural Information Processing Systems (NeurIPS), 2024
Lin Chen
Xilin Wei
Jinsong Li
Xiaoyi Dong
Pan Zhang
...
Li Yuan
Yu Qiao
Dahua Lin
Feng Zhao
Jiaqi Wang
440
383
0
06 Jun 2024
What matters when building vision-language models?
Neural Information Processing Systems (NeurIPS), 2024
Hugo Laurençon
Léo Tronchon
Matthieu Cord
Victor Sanh
VLM
489
320
0
03 May 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu
Yilin Zhao
Daquan Zhou
Zhijie Lin
See Kiong Ng
Jiashi Feng
MLLM
VLM
368
324
0
25 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
446
40
0
22 Apr 2024
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua
Yunlong Tang
Chenliang Xu
Jiebo Luo
VGen
468
58
0
18 Apr 2024
Streaming Dense Video Captioning
Xingyi Zhou
Anurag Arnab
Shyamal Buch
Shen Yan
Austin Myers
Xuehan Xiong
Arsha Nagrani
Cordelia Schmid
VLM
292
87
0
01 Apr 2024
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Xiaohan Wang
Yuhui Zhang
Orr Zohar
Serena Yeung-Levy
VLM
502
283
0
15 Mar 2024
TempCompass: Do Video LLMs Really Understand Videos?
Yuanxin Liu
Shicheng Li
Yi Liu
Yuxiang Wang
Shuhuai Ren
Lei Li
Sishuo Chen
Xu Sun
Lu Hou
VLM
518
272
0
01 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
491
387
0
29 Feb 2024
OSCaR: Object State Captioning and State Change Representation
Nguyen Nguyen
Jing Bi
Ali Vosoughi
Yapeng Tian
Pooyan Fazli
Chenliang Xu
682
16
0
27 Feb 2024
Video ReCap: Recursive Captioning of Hour-Long Videos
Md. Mohaiminul Islam
Ngan Ho
Xitong Yang
Tushar Nagarajan
Lorenzo Torresani
Gedas Bertasius
VGen
VLM
793
93
0
20 Feb 2024
Learning Object State Changes in Videos: An Open-World Perspective
Zihui Xue
Kumar Ashutosh
Kristen Grauman
VGen
391
37
0
19 Dec 2023
VILA: On Pre-training for Visual Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
MLLM
VLM
787
778
0
12 Dec 2023
Describing Differences in Image Sets with Natural Language
Computer Vision and Pattern Recognition (CVPR), 2023
Lisa Dunlap
Yuhui Zhang
Xiaohan Wang
Ruiqi Zhong
Trevor Darrell
Jacob Steinhardt
Joseph E. Gonzalez
Serena Yeung-Levy
CoGe
VLM
526
58
0
05 Dec 2023
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2023
Shuhuai Ren
Linli Yao
Shicheng Li
Xu Sun
Lu Hou
VLM
MLLM
489
412
0
04 Dec 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
European Conference on Computer Vision (ECCV), 2023
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
275
51
0
29 Nov 2023
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Zhiyuan Zhao
Sijin Yu
Linke Ouyang
Xiao-wen Dong
Yuan Liu
Conghui He
MLLM
VLM
484
217
0
28 Nov 2023
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Computer Vision and Pattern Recognition (CVPR), 2023
Tianrui Guan
Fuxiao Liu
Xiyang Wu
Ruiqi Xian
Zongxia Li
...
Lichang Chen
Furong Huang
Yaser Yacoob
Dinesh Manocha
Wanrong Zhu
VLM
MLLM
630
471
0
23 Oct 2023
AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description
IEEE International Conference on Computer Vision (ICCV), 2023
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
324
54
0
10 Oct 2023
Improved Baselines with Visual Instruction Tuning
Computer Vision and Pattern Recognition (CVPR), 2023
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLM
MLLM
749
4,820
0
05 Oct 2023
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
IEEE International Conference on Computer Vision (ICCV), 2023
Ziyang Wang
Yi-Lin Sung
Feng Cheng
Gedas Bertasius
Joey Tianyi Zhou
469
89
0
18 Sep 2023
The Change You Want to See (Now in 3D)
Ragav Sachdeva
Andrew Zisserman
3DPC
VGen
369
19
0
21 Aug 2023
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
Neural Information Processing Systems (NeurIPS), 2023
K. Mangalam
Raiymbek Akshulakov
Jitendra Malik
497
594
0
17 Aug 2023
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Hang Zhang
Xin Li
Lidong Bing
MLLM
806
1,669
0
05 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Neural Information Processing Systems (NeurIPS), 2023
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
1.1K
8,135
0
29 May 2023
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye
Haiyang Xu
Guohai Xu
Jiabo Ye
Ming Yan
...
Junfeng Tian
Qiang Qi
Ji Zhang
Feiyan Huang
Jingren Zhou
VLM
MLLM
1.2K
1,215
0
27 Apr 2023
A Review of Deep Learning for Video Captioning
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Moloud Abdar
Meenakshi Kollati
Swaraja Kuraparthi
Farhad Pourpanah
Daniel J. McDuff
...
Shuicheng Yan
Abduallah A. Mohamed
Abbas Khosravi
Xiaoshi Zhong
Fatih Porikli
3DV
281
48
0
22 Apr 2023
Visual Instruction Tuning
Neural Information Processing Systems (NeurIPS), 2023
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDa
VLM
MLLM
1.4K
8,828
0
17 Apr 2023
Verbs in Action: Improving verb understanding in video-language models
IEEE International Conference on Computer Vision (ICCV), 2023
Liliane Momeni
Mathilde Caron
Arsha Nagrani
Andrew Zisserman
Cordelia Schmid
547
89
0
13 Apr 2023
AutoAD: Movie Description in Context
Computer Vision and Pattern Recognition (CVPR), 2023
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
300
57
0
29 Mar 2023
Sigmoid Loss for Language Image Pre-Training
IEEE International Conference on Computer Vision (ICCV), 2023
Xiaohua Zhai
Basil Mustafa
Alexander Kolesnikov
Lucas Beyer
CLIP
VLM
2.6K
2,833
0
27 Mar 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
5.3K
23,506
0
15 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Computer Vision and Pattern Recognition (CVPR), 2023
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
589
364
0
27 Feb 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
International Conference on Machine Learning (ICML), 2023
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLM
MLLM
1.6K
7,623
0
30 Jan 2023
Make-A-Video: Text-to-Video Generation without Text-Video Data
International Conference on Learning Representations (ICLR), 2022
Uriel Singer
Adam Polyak
Thomas Hayes
Xiaoyue Yin
Jie An
...
Oron Ashual
Oran Gafni
Devi Parikh
Sonal Gupta
Yaniv Taigman
DiffM
VGen
411
1,961
0
29 Sep 2022
Revealing Single Frame Bias for Video-and-Language Learning
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Jie Lei
Tamara L. Berg
Joey Tianyi Zhou
291
151
0
07 Jun 2022
Revisiting the "Video" in Video-Language Understanding
Computer Vision and Pattern Recognition (CVPR), 2022
S. Buch
Cristobal Eyzaguirre
Adrien Gaidon
Jiajun Wu
L. Fei-Fei
Juan Carlos Niebles
255
215
0
03 Jun 2022
NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions
Computer Vision and Pattern Recognition (CVPR), 2021
Junbin Xiao
Xindi Shang
Angela Yao
Tat-Seng Chua
507
817
0
18 May 2021
1
2
Next
Page 1 of 2