Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1607.08822
Cited By
SPICE: Semantic Propositional Image Caption Evaluation
29 July 2016
Peter Anderson
Basura Fernando
Mark Johnson
Stephen Gould
EGVM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"SPICE: Semantic Propositional Image Caption Evaluation"
50 / 1,002 papers shown
GiT: Towards Generalist Vision Transformer through Universal Language Interface
European Conference on Computer Vision (ECCV), 2024
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Jiaming Song
Bernt Schiele
Liwei Wang
VLM
279
22
0
14 Mar 2024
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes
Ting Yu
Xiaojun Lin
Shuhui Wang
Weiguo Sheng
Qingming Huang
Jun-chen Yu
3DV
222
17
0
12 Mar 2024
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
304
46
0
06 Mar 2024
Neural Image Compression with Text-guided Encoding for both Pixel-level and Perceptual Fidelity
Hagyeong Lee
Minkyu Kim
Jun-Hyuk Kim
Seungeon Kim
Dokwan Oh
Jaeho Lee
DiffM
231
17
0
05 Mar 2024
DECIDER: A Dual-System Rule-Controllable Decoding Framework for Language Generation
Chen Xu
Tian Lan
Changlong Yu
Wei Wang
Jun Gao
...
Qunxi Dong
Kun Qian
Piji Li
Wei Bi
Bin Hu
392
2
0
04 Mar 2024
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning
Yuiga Wada
Kanta Kaneda
Daichi Saito
Komei Sugiura
212
47
0
28 Feb 2024
Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction
Koki Maeda
Shuhei Kurita
Taiki Miyanishi
Naoaki Okazaki
222
6
0
28 Feb 2024
EDTC: enhance depth of text comprehension in automated audio captioning
Liwen Tan
Yin Cao
Yi Zhou
207
0
0
27 Feb 2024
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang
Ziqiao Ma
Xiaofeng Gao
Suhaila Shakiah
Qiaozi Gao
Joyce Chai
MLLM
VLM
376
74
0
26 Feb 2024
AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation
Yasheng Sun
Wenqing Chu
Hang Zhou
Kaisiyuan Wang
Hideki Koike
156
11
0
25 Feb 2024
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages
Minsu Kim
Jee-weon Jung
Hyeongseop Rha
Soumi Maiti
Siddhant Arora
Xuankai Chang
Shinji Watanabe
Y. Ro
331
8
0
25 Feb 2024
Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning
Antoine Chaffin
Ewa Kijak
Vincent Claveau
260
3
0
21 Feb 2024
MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning
Wanqing Cui
Keping Bi
Jiafeng Guo
Xueqi Cheng
SyDa
ReLM
RALM
LRM
337
14
0
21 Feb 2024
SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction
Jie Xu
Hanbo Zhang
Xinghang Li
Huaping Liu
Xuguang Lan
Tao Kong
LM&Ro
282
5
0
19 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
242
4
0
18 Feb 2024
ProtChatGPT: Towards Understanding Proteins with Large Language Models
Chao Wang
Hehe Fan
Ruijie Quan
Yi Yang
232
21
0
15 Feb 2024
A Systematic Review of Data-to-Text NLG
Chinonso Osuji
Thiago Castro Ferreira
Brian Davis
328
4
0
13 Feb 2024
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Interspeech (Interspeech), 2024
Hang Zhao
Yifei Xin
Zhesong Yu
Bilei Zhu
Lu Lu
Zejun Ma
AuLLM
287
5
0
12 Feb 2024
Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy
International Conference on Learning Representations (ICLR), 2024
Simon Ging
M. A. Bravo
Thomas Brox
VLM
401
19
0
11 Feb 2024
CIC: A Framework for Culturally-Aware Image Captioning
Youngsik Yun
Jihie Kim
VLM
413
10
0
08 Feb 2024
Multimodal Rationales for Explainable Visual Question Answering
Kun Li
G. Vosselman
Michael Ying Yang
504
2
0
06 Feb 2024
SymbolicAI: A framework for logic-based approaches combining generative models and solvers
Marius-Constantin Dinu
Claudiu Leoveanu-Condrei
Markus Holzleitner
Werner Zellinger
Sepp Hochreiter
319
18
0
01 Feb 2024
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling
Eileen Wang
S. Han
Josiah Poon
278
5
0
01 Feb 2024
Common Sense Reasoning for Deepfake Detection
Yue Zhang
Ben Colman
Xiao Guo
Ali Shahriyari
Gaurav Bharaj
481
59
0
31 Jan 2024
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Jinjoo Lee
Sang Hoon Woo
CLIP
VLM
203
42
0
31 Jan 2024
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data
Chenhui Zhang
Sherrie Wang
283
36
0
31 Jan 2024
A Survey on Data Augmentation in Large Model Era
Yue Zhou
Chenlu Guo
Xu Wang
Yi-Ju Chang
Yuan Wu
LM&MA
VLM
485
49
0
27 Jan 2024
Zero Shot Open-ended Video Inference
Ee Yeo Keat
Zhang Hao
Alexander Matyasko
Basura Fernando
VLM
146
0
0
23 Jan 2024
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data
International Conference on Learning Representations (ICLR), 2024
Yuhui Zhang
Elaine Sui
Serena Yeung-Levy
198
17
0
16 Jan 2024
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Longtian Qiu
Shan Ning
Xuming He
VLM
212
12
0
04 Jan 2024
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
717
167
0
29 Dec 2023
Towards Consistent Language Models Using Declarative Constraints
Jasmin Mousavi
Arash Termehchy
HILM
ALM
203
2
0
24 Dec 2023
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning
AAAI Conference on Artificial Intelligence (AAAI), 2023
Zhiyue Liu
Jinyuan Liu
Fanrong Ma
CLIP
VLM
254
20
0
14 Dec 2023
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Xinpeng Wang
Xiaoyuan Yi
Han Jiang
Shanlin Zhou
Zhihua Wei
Xing Xie
251
25
0
13 Dec 2023
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization
Dongchen Han
Yang Liu
Yang Bai
Jindong Gu
Yang Liu
Simeng Qin
VLM
278
32
0
07 Dec 2023
Towards Knowledge-driven Autonomous Driving
Xin Li
Yeqi Bai
Pinlong Cai
Licheng Wen
Daocheng Fu
...
Yikang Li
Ding Wang
Yong-Jin Liu
Xiaoling Wang
Yu Qiao
413
36
0
07 Dec 2023
Mitigating Open-Vocabulary Caption Hallucinations
Assaf Ben-Kish
Moran Yanuka
Morris Alper
Raja Giryes
Hadar Averbuch-Elor
MLLM
VLM
395
14
0
06 Dec 2023
Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment
European Conference on Computer Vision (ECCV), 2023
Brian Gordon
Yonatan Bitton
Yonatan Shafir
Roopal Garg
Xi Chen
Dani Lischinski
Daniel Cohen-Or
Idan Szpektor
240
17
0
05 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2023
Cong Yang
Zuchao Li
Lefei Zhang
163
61
0
02 Dec 2023
Segment and Caption Anything
Computer Vision and Pattern Recognition (CVPR), 2023
Xiaoke Huang
Jianfeng Wang
Yansong Tang
Zheng Zhang
Han Hu
Jiwen Lu
Lijuan Wang
Zicheng Liu
MLLM
VLM
244
33
0
01 Dec 2023
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Computer Vision and Pattern Recognition (CVPR), 2023
Chaoyi Zhang
Kevin Qinghong Lin
Zhengyuan Yang
Jianfeng Wang
Linjie Li
Chung-Ching Lin
Zicheng Liu
Lijuan Wang
VGen
250
49
0
29 Nov 2023
StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Kazuki Yamauchi
Yusuke Ijima
Yuki Saito
178
12
0
28 Nov 2023
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism
European Conference on Computer Vision (ECCV), 2023
Zhen Wang
Xinyun Jiang
Jun Xiao
Tao Chen
Long Chen
DiffM
237
4
0
25 Nov 2023
From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Jiaxin Ge
Sanjay Subramanian
Trevor Darrell
Boyi Li
LRM
252
4
0
21 Nov 2023
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models
Xiaotian Han
Quanzeng You
Yongfei Liu
Wentao Chen
Huangjie Zheng
...
Yiqi Wang
Bohan Zhai
Jianbo Yuan
Heng Wang
Hongxia Yang
ReLM
LRM
ELM
422
11
0
20 Nov 2023
Trustworthy Large Models in Vision: A Survey
Ziyan Guo
Kepeng Xu
Jun Liu
MU
653
0
0
16 Nov 2023
Zero-shot audio captioning with audio-language model guidance and audio context keywords
Leonard Salewski
Stefan Fauth
A. Sophia Koepke
Zeynep Akata
202
15
0
14 Nov 2023
Improving Image Captioning via Predicting Structured Concepts
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ting Wang
Weidong Chen
Yuanhe Tian
Yan Song
Zhendong Mao
221
11
0
14 Nov 2023
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu
Jin Xu
Xiaohuan Zhou
Qian Yang
Shiliang Zhang
Zhijie Yan
Chang Zhou
Jingren Zhou
AuLLM
320
595
0
14 Nov 2023
Zero-shot Translation of Attention Patterns in VQA Models to Natural Language
Leonard Salewski
A. Sophia Koepke
Hendrik P. A. Lensch
Zeynep Akata
210
4
0
08 Nov 2023
Previous
1
2
3
...
5
6
7
...
19
20
21
Next