Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2404.12353
Cited By
v1
v2
v3 (latest)
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
18 April 2024
Hang Hua
Yunlong Tang
Chenliang Xu
Jiebo Luo
VGen
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning"
27 / 27 papers shown
Title
Latent Chain-of-Thought for Visual Reasoning
Guohao Sun
Hang Hua
Jian Wang
Jiebo Luo
S. Dianat
Majid Rabbani
Raghuveer Rao
Zhiqiang Tao
BDL
OffRL
LRM
187
4
0
27 Oct 2025
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang
Jing Bi
Pinxin Liu
Zhenyu Pan
Mingqian Feng
...
Zeliang Zhang
Daiki Shimada
Han Liu
Jiebo Luo
Chenliang Xu
MLLM
OffRL
VLM
LRM
486
7
0
06 Oct 2025
FrameOracle: Learning What to See and How Much to See in Videos
Chaoyu Li
Tianzhi Li
Fei Tao
Zhenyu Zhao
Ziqian Wu
Maozheng Zhao
Juntong Song
Cheng Niu
Pooyan Fazli
VLM
76
0
0
04 Oct 2025
Trustworthy Summarization via Uncertainty Quantification and Risk Awareness in Large Language Models
Shuaidong Pan
Di Wu
HILM
114
6
0
23 Sep 2025
Harnessing Object Grounding for Time-Sensitive Video Understanding
Tz-Ying Wu
S. N. Sridhar
Subarna Tripathi
81
0
0
08 Sep 2025
Spiking Variational Graph Representation Inference for Video Summarization
IEEE Transactions on Image Processing (IEEE TIP), 2025
Wenrui Li
Wei Han
Liang-Jian Deng
Ruiqin Xiong
Xiaopeng Fan
64
3
0
21 Aug 2025
Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
Liangliang You
Junchi Yao
Shu Yang
Guimin Hu
Lijie Hu
Di Wang
MLLM
183
2
0
08 Jun 2025
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Galann Pennec
Zhengyuan Liu
Nicholas Asher
Philippe Muller
Nancy F. Chen
VGen
359
0
0
10 May 2025
HierSum: A Global and Local Attention Mechanism for Video Summarization
Apoorva Beedu
Irfan Essa
781
0
0
25 Apr 2025
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
Yunlong Tang
Jing Bi
Chao Huang
Susan Liang
Daiki Shimada
...
Jinxi He
Liu He
Zeliang Zhang
Jiebo Luo
Chenliang Xu
194
8
0
07 Apr 2025
WikiVideo: Article Generation from Multiple Videos
Alexander Martin
Reno Kriz
William Walden
Kate Sanders
Hannah Recknor
Eugene Yang
Francis Ferraro
Benjamin Van Durme
DiffM
VGen
342
3
0
01 Apr 2025
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Jing Bi
Junjia Guo
Susan Liang
Guangyu Sun
Luchuan Song
...
Jinxi He
Jiarui Wu
Ali Vosoughi
Chong Chen
Chenliang Xu
LRM
186
15
0
14 Mar 2025
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Baining Zhao
Jianjie Fang
Zichao Dai
Liang Luo
Jirong Zha
...
Chen Gao
Yijiao Wang
Jinqiang Cui
Xinlei Chen
Yongqian Li
269
19
0
08 Mar 2025
Do Language Models Understand Time?
The Web Conference (WWW), 2024
Xi Ding
Lei Wang
644
9
0
18 Dec 2024
Progress-Aware Video Frame Captioning
Computer Vision and Pattern Recognition (CVPR), 2024
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
512
5
0
03 Dec 2024
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Computer Vision and Pattern Recognition (CVPR), 2024
Hang Hua
Qing Liu
Lingzhi Zhang
Jing Shi
Zhifei Zhang
Yilin Wang
Jianming Zhang
Jiebo Luo
CoGe
VLM
272
16
0
23 Nov 2024
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
Computer Vision and Pattern Recognition (CVPR), 2024
Yunlong Tang
Junjia Guo
Hang Hua
Susan Liang
Mingqian Feng
...
Chao Huang
Jing Bi
Zeliang Zhang
Pooyan Fazli
Chenliang Xu
CoGe
332
14
0
17 Nov 2024
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua
Yunlong Tang
Ziyun Zeng
Liangliang Cao
Zhengyuan Yang
Hangfeng He
Chenliang Xu
Jiebo Luo
VLM
CoGe
164
21
0
13 Oct 2024
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
International Conference on Learning Representations (ICLR), 2024
Yongxin Guo
Jingyu Liu
Mingda Li
Xiaoying Tang
Qingbin Liu
Xiaoying Tang
221
43
0
08 Oct 2024
EAGLE: Egocentric AGgregated Language-video Engine
ACM Multimedia (MM), 2024
Jing Bi
Yunlong Tang
Luchuan Song
Ali Vosoughi
Nguyen Nguyen
Chenliang Xu
178
15
0
26 Sep 2024
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
Zixin Guo
Jian Zhang
327
3
0
24 Sep 2024
CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
AAAI Conference on Artificial Intelligence (AAAI), 2024
Yunlong Tang
Gen Zhan
Li Yang
Yiting Liao
Chenliang Xu
VGen
DiffM
LRM
276
13
0
21 Aug 2024
Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
Liu He
Yizhi Song
Hejun Huang
Pinxin Liu
Yunlong Tang
Daniel G. Aliaga
Xin Zhou
DiffM
VGen
362
8
0
19 Aug 2024
PromptFix: You Prompt and We Fix the Photo
Yongsheng Yu
Ziyun Zeng
Hang Hua
Jianlong Fu
Jiebo Luo
MLLM
DiffM
VLM
168
37
0
27 May 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua
Jing Shi
Kushal Kafle
Simon Jenni
Daoan Zhang
John Collomosse
Scott D. Cohen
Jiebo Luo
CoGe
VLM
174
14
0
23 Apr 2024
Tri
2
^{2}
2
-plane: Thinking Head Avatar via Feature Pyramid
European Conference on Computer Vision (ECCV), 2024
Luchuan Song
Pinxin Liu
Lele Chen
Guojun Yin
Chenliang Xu
3DH
228
14
0
17 Jan 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
571
152
0
29 Dec 2023
1