Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2304.04227
Cited By
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
9 April 2023
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions"
32 / 32 papers shown
Title
ImageSet2Text: Describing Sets of Images through Text
Piera Riccio
F. Galati
Kajetan Schweighofer
Noa Garcia
Nuria Oliver
VLM
CoGe
72
0
0
25 Mar 2025
Fine-Grained Video Captioning through Scene Graph Consolidation
Sanghyeok Chu
Seonguk Seo
Bohyung Han
48
1
0
23 Feb 2025
Character-aware audio-visual subtitling in context
Jaesung Huh
Andrew Zisserman
31
0
0
14 Oct 2024
G
2
^{2}
2
TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora
N. N.
Aman Tambi
Sandeep S. Zachariah
Souvik Chakraborty
Rohan Paul
LM&Ro
26
0
0
10 Oct 2024
UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark
Hasnat Md Abdullah
Tian Liu
Kangda Wei
Shu Kong
Ruihong Huang
29
2
0
02 Oct 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
34
1
0
19 Sep 2024
GradBias: Unveiling Word Influence on Bias in Text-to-Image Generative Models
Moreno DÍncà
E. Peruzzo
Massimiliano Mancini
Xingqian Xu
Humphrey Shi
N. Sebe
39
0
0
29 Aug 2024
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
Xiaowei Chi
Yatian Wang
Aosong Cheng
Pengjun Fang
Zeyue Tian
...
Wenhan Luo
Qifeng Chen
Shanghang Zhang
Qi-fei Liu
Yi-Ting Guo
67
7
0
30 Jul 2024
IWISDM: Assessing instruction following in multimodal models at scale
Xiaoxuan Lei
Lucas Gomez
Hao Yuan Bai
P. Bashivan
VLM
17
1
0
20 Jun 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Lin Chen
Xilin Wei
Jinsong Li
Xiaoyi Dong
Pan Zhang
...
Li Yuan
Yu Qiao
Dahua Lin
Feng Zhao
Jiaqi Wang
72
138
0
06 Jun 2024
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Ling-Hao Chen
Shunlin Lu
Ailing Zeng
Hao Zhang
Benyou Wang
Ruimao Zhang
Lei Zhang
45
33
0
30 May 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
40
56
0
29 May 2024
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal
Michael Wray
Dima Damen
31
3
0
15 Apr 2024
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
Moreno DÍncà
E. Peruzzo
Massimiliano Mancini
Dejia Xu
Vidit Goel
Xingqian Xu
Zhangyang Wang
Humphrey Shi
N. Sebe
53
31
0
11 Apr 2024
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning
Alexandros Xenos
Niki Maria Foteinopoulou
Ioanna Ntinou
Ioannis Patras
Georgios Tzimiropoulos
19
15
0
10 Apr 2024
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng
Mingfei Han
Haoyu He
Xiaojun Chang
Bohan Zhuang
VLM
60
56
0
04 Apr 2024
Contextual AD Narration with Interleaved Multimodal Sequence
Hanlin Wang
Zhan Tong
Kecheng Zheng
Yujun Shen
Limin Wang
VGen
47
4
0
19 Mar 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Ping Luo
Jiebo Luo
Chenliang Xu
VLM
50
81
0
29 Dec 2023
A Simple LLM Framework for Long-Range Video Question-Answering
Ce Zhang
Taixi Lu
Md. Mohaiminul Islam
Ziyang Wang
Shoubin Yu
Mohit Bansal
Gedas Bertasius
100
80
0
28 Dec 2023
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen
Deyao Zhu
Xiaoqian Shen
Xiang Li
Zechun Liu
Pengchuan Zhang
Raghuraman Krishnamoorthi
Vikas Chandra
Yunyang Xiong
Mohamed Elhoseiny
MLLM
160
280
0
14 Oct 2023
Language as the Medium: Multimodal Video Classification through text only
Laura Hanu
A. Vero
James Thewlis
41
3
0
19 Sep 2023
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
Kilichbek Haydarov
Xiaoqian Shen
Avinash Madasu
Mahmoud Salem
Jia Li
Gamaleldin F. Elsayed
Mohamed Elhoseiny
28
4
0
30 Aug 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
18
116
0
25 Jul 2023
Enhance Reasoning Ability of Visual-Language Models via Large Language Models
Yueting Yang
Xintong Zhang
Wenjuan Han
VLM
ReLM
LRM
22
1
0
22 May 2023
Exploring Human-Like Translation Strategy with Large Language Models
Zhiwei He
Tian Liang
Wenxiang Jiao
Zhuosheng Zhang
Yujiu Yang
Rui Wang
Zhaopeng Tu
Shuming Shi
Xing Wang
19
39
0
06 May 2023
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System
Junke Wang
Dongdong Chen
Chong Luo
Xiyang Dai
Lu Yuan
Zuxuan Wu
Yu-Gang Jiang
93
54
0
27 Apr 2023
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu
Jun Chen
Xiaoqian Shen
Xiang Li
Mohamed Elhoseiny
VLM
MLLM
18
1,877
0
20 Apr 2023
LLM as A Robotic Brain: Unifying Egocentric Memory and Control
Jinjie Mai
Jun Chen
Bing Li
Guocheng Qian
Mohamed Elhoseiny
Bernard Ghanem
LM&Ro
10
33
0
19 Apr 2023
Visual Language Maps for Robot Navigation
Chen Huang
Oier Mees
Andy Zeng
Wolfram Burgard
LM&Ro
145
337
0
11 Oct 2022
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAG
ReLM
LRM
233
2,413
0
06 Oct 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
388
4,010
0
28 Jan 2022
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
240
573
0
22 Apr 2021
1