v1v2 (latest)

Progress-Aware Video Frame Captioning

Computer Vision and Pattern Recognition (CVPR), 2024

3 December 2024

ArXiv (abs)PDF HTML Github

Papers citing "Progress-Aware Video Frame Captioning"

50 / 69 papers shown

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

...

294

12 Oct 2025

It's Just Another Day: Unique Video Captioning by Discriminative PromptingAsian Conference on Computer Vision (ACCV), 2024

289

15 Oct 2024

When Does Perceptual Alignment Benefit Vision Representations?Neural Information Processing Systems (NeurIPS), 2024

354

14 Oct 2024

AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkInternational Conference on Learning Representations (ICLR), 2024

Christopher D. Manning

3DV

827

118

04 Oct 2024

LLaVA-Video: Video Instruction Tuning With Synthetic Data

584

248

03 Oct 2024

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024

Yaliang Li

338

08 Aug 2024

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li

Yuanhan Zhang

Dong Guo

Renrui Zhang

Feng Li

Hao Zhang

Kaichen Zhang

Yanwei Li

Ziwei Liu

Chunyuan Li

MLLM SyDa VLM

797

2,258

06 Aug 2024

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li

Renrui Zhang

Hao Zhang

Yuanhan Zhang

Bo Li

Wei Li

Zejun Ma

Chunyuan Li

MLLM VLM

516

534

10 Jul 2024

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Jiawei Wang

Liping Yuan

Yuchen Zhang

351

135

30 Jun 2024

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

Yu-Chiang Frank Wang

438

27 Jun 2024

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Yuxuan Wang

Yueqian Wang

Dongyan Zhao

Cihang Xie

Zilong Zheng

MLLM VLM

324

24 Jun 2024

NarrativeBridge: Enhancing Video Captioning with Causal-Temporal NarrativeInternational Conference on Learning Representations (ICLR), 2024

515

10 Jun 2024

ShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsNeural Information Processing Systems (NeurIPS), 2024

Lin Chen

Xilin Wei

Jinsong Li

Xiaoyi Dong

Pan Zhang

...

Li Yuan

Yu Qiao

Dahua Lin

Feng Zhao

Jiaqi Wang

440

383

06 Jun 2024

What matters when building vision-language models?Neural Information Processing Systems (NeurIPS), 2024

489

320

03 May 2024

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

See Kiong Ng

368

324

25 Apr 2024

AutoAD III: The Prequel -- Back to the Pixels

446

22 Apr 2024

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

468

18 Apr 2024

Streaming Dense Video Captioning

292

01 Apr 2024

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

502

283

15 Mar 2024

TempCompass: Do Video LLMs Really Understand Videos?

Shicheng Li

Lei Li

518

272

01 Mar 2024

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

...

Hsin-Ying Lee

Ming-Hsuan Yang

491

387

29 Feb 2024

OSCaR: Object State Captioning and State Change Representation

682

27 Feb 2024

Video ReCap: Recursive Captioning of Hour-Long Videos

Gedas Bertasius

793

20 Feb 2024

Learning Object State Changes in Videos: An Open-World Perspective

391

19 Dec 2023

VILA: On Pre-training for Visual Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023

Song Han

787

778

12 Dec 2023

Describing Differences in Image Sets with Natural LanguageComputer Vision and Pattern Recognition (CVPR), 2023

Ruiqi Zhong

526

05 Dec 2023

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023

Shicheng Li

489

412

04 Dec 2023

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language ModelsEuropean Conference on Computer Vision (ECCV), 2023

Shicheng Li

Lei Li

275

29 Nov 2023

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Conghui He

484

217

28 Nov 2023

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023

Fuxiao Liu

...

Furong Huang

630

471

23 Oct 2023

AutoAD II: The Sequel -- Who, When, and What in Movie Audio DescriptionIEEE International Conference on Computer Vision (ICCV), 2023

324

10 Oct 2023

Improved Baselines with Visual Instruction TuningComputer Vision and Pattern Recognition (CVPR), 2023

749

4,820

05 Oct 2023

Unified Coarse-to-Fine Alignment for Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2023

Gedas Bertasius

469

18 Sep 2023

The Change You Want to See (Now in 3D)

Ragav Sachdeva

Andrew Zisserman

3DPC VGen

369

21 Aug 2023

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingNeural Information Processing Systems (NeurIPS), 2023

K. Mangalam

Raiymbek Akshulakov

Jitendra Malik

497

594

17 Aug 2023

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video UnderstandingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Hang Zhang

Xin Li

Lidong Bing

MLLM

806

1,669

05 Jun 2023

Direct Preference Optimization: Your Language Model is Secretly a Reward ModelNeural Information Processing Systems (NeurIPS), 2023

Christopher D. Manning

Chelsea Finn

ALM

1.1K

8,135

29 May 2023

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Jiabo Ye

...

Ji Zhang

Jingren Zhou

1.2K

1,215

27 Apr 2023

A Review of Deep Learning for Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

...

Fatih Porikli

281

22 Apr 2023

Visual Instruction TuningNeural Information Processing Systems (NeurIPS), 2023

1.4K

8,828

17 Apr 2023

Verbs in Action: Improving verb understanding in video-language modelsIEEE International Conference on Computer Vision (ICCV), 2023

547

13 Apr 2023

AutoAD: Movie Description in ContextComputer Vision and Pattern Recognition (CVPR), 2023

300

29 Mar 2023

Sigmoid Loss for Language Image Pre-TrainingIEEE International Conference on Computer Vision (ICCV), 2023

2.6K

2,833

27 Mar 2023

GPT-4 Technical Report

...

5.3K

23,506

15 Mar 2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningComputer Vision and Pattern Recognition (CVPR), 2023

589

364

27 Feb 2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsInternational Conference on Machine Learning (ICML), 2023

Silvio Savarese

1.6K

7,623

30 Jan 2023

Make-A-Video: Text-to-Video Generation without Text-Video DataInternational Conference on Learning Representations (ICLR), 2022

...

Devi Parikh

411

1,961

29 Sep 2022

Revealing Single Frame Bias for Video-and-Language LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Jie Lei

Tamara L. Berg

Joey Tianyi Zhou

291

151

07 Jun 2022

Revisiting the "Video" in Video-Language UnderstandingComputer Vision and Pattern Recognition (CVPR), 2022

S. Buch

Cristobal Eyzaguirre

Adrien Gaidon

Jiajun Wu

L. Fei-Fei

Juan Carlos Niebles

255

215

03 Jun 2022

NExT-QA:Next Phase of Question-Answering to Explaining Temporal ActionsComputer Vision and Pattern Recognition (CVPR), 2021

Junbin Xiao

Xindi Shang

Angela Yao

Tat-Seng Chua

507

817

18 May 2021