SPICE: Semantic Propositional Image Caption Evaluation

29 July 2016

Papers citing "SPICE: Semantic Propositional Image Caption Evaluation"

50 / 1,002 papers shown

GiT: Towards Generalist Vision Transformer through Universal Language InterfaceEuropean Conference on Computer Vision (ECCV), 2024

Muhammad Ferjad Naeem

Jiaming Song

Bernt Schiele

Liwei Wang

VLM

279

14 Mar 2024

A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes

222

12 Mar 2024

MeaCap: Memory-Augmented Zero-shot Image Captioning

304

06 Mar 2024

Neural Image Compression with Text-guided Encoding for both Pixel-level and Perceptual Fidelity

231

05 Mar 2024

DECIDER: A Dual-System Rule-Controllable Decoding Framework for Language Generation

...

392

04 Mar 2024

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

212

28 Feb 2024

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

222

28 Feb 2024

EDTC: enhance depth of text comprehension in automated audio captioning

Liwen Tan

Yin Cao

Yi Zhou

207

27 Feb 2024

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

376

26 Feb 2024

AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation

156

25 Feb 2024

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

331

25 Feb 2024

Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning

Antoine Chaffin

Ewa Kijak

Vincent Claveau

260

21 Feb 2024

MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning

337

21 Feb 2024

SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction

282

19 Feb 2024

Cobra Effect in Reference-Free Image Captioning Metrics

242

18 Feb 2024

ProtChatGPT: Towards Understanding Proteins with Large Language Models

Chao Wang

Hehe Fan

Ruijie Quan

Yi Yang

232

15 Feb 2024

A Systematic Review of Data-to-Text NLG

Chinonso Osuji

Thiago Castro Ferreira

Brian Davis

328

13 Feb 2024

MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction TuningInterspeech (Interspeech), 2024

Yifei Xin

287

12 Feb 2024

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchyInternational Conference on Learning Representations (ICLR), 2024

Simon Ging

M. A. Bravo

Thomas Brox

VLM

401

11 Feb 2024

CIC: A Framework for Culturally-Aware Image Captioning

Youngsik Yun

Jihie Kim

VLM

413

08 Feb 2024

Multimodal Rationales for Explainable Visual Question Answering

Kun Li

G. Vosselman

Michael Ying Yang

504

06 Feb 2024

SymbolicAI: A framework for logic-based approaches combining generative models and solvers

Marius-Constantin Dinu

Claudiu Leoveanu-Condrei

Markus Holzleitner

Werner Zellinger

Sepp Hochreiter

319

01 Feb 2024

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

Eileen Wang

S. Han

Josiah Poon

278

01 Feb 2024

Common Sense Reasoning for Deepfake Detection

481

31 Jan 2024

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Jaeyeon Kim

Jaeyoon Jung

Jinjoo Lee

Sang Hoon Woo

CLIP VLM

203

31 Jan 2024

Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data

Chenhui Zhang

Sherrie Wang

283

31 Jan 2024

A Survey on Data Augmentation in Large Model Era

485

27 Jan 2024

Zero Shot Open-ended Video Inference

146

23 Jan 2024

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal DataInternational Conference on Learning Representations (ICLR), 2024

Yuhui Zhang

Elaine Sui

Serena Yeung-Levy

198

16 Jan 2024

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

212

04 Jan 2024

Video Understanding with Large Language Models: A Survey

...

717

167

29 Dec 2023

Towards Consistent Language Models Using Declarative Constraints

Jasmin Mousavi

Arash Termehchy

HILM ALM

203

24 Dec 2023

Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image CaptioningAAAI Conference on Artificial Intelligence (AAAI), 2023

254

14 Dec 2023

ToViLaG: Your Visual-Language Generative Model is Also An EvildoerConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Xing Xie

251

13 Dec 2023

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

Dongchen Han

Yang Liu

Yang Bai

Jindong Gu

Yang Liu

Simeng Qin

VLM

278

07 Dec 2023

Towards Knowledge-driven Autonomous Driving

Licheng Wen

...

Yu Qiao

413

07 Dec 2023

Mitigating Open-Vocabulary Caption Hallucinations

395

06 Dec 2023

Mismatch Quest: Visual and Textual Feedback for Image-Text MisalignmentEuropean Conference on Computer Vision (ECCV), 2023

Daniel Cohen-Or

240

05 Dec 2023

Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image CaptioningIEEE Transactions on Geoscience and Remote Sensing (TGRS), 2023

Cong Yang

Zuchao Li

Lefei Zhang

163

02 Dec 2023

Segment and Caption AnythingComputer Vision and Pattern Recognition (CVPR), 2023

Zicheng Liu

244

01 Dec 2023

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context LearningComputer Vision and Pattern Recognition (CVPR), 2023

Zicheng Liu

250

29 Nov 2023

StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning ModelsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

Kazuki Yamauchi

Yusuke Ijima

Yuki Saito

178

28 Nov 2023

DECap: Towards Generalized Explicit Caption Editing via Diffusion MechanismEuropean Conference on Computer Vision (ECCV), 2023

237

25 Nov 2023

From Wrong To Right: A Recursive Approach Towards Vision-Language ExplanationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Boyi Li

252

21 Nov 2023

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

...

Jianbo Yuan

Heng Wang

Hongxia Yang

ReLM LRM ELM

422

20 Nov 2023

Trustworthy Large Models in Vision: A Survey

Ziyan Guo

Kepeng Xu

Jun Liu

653

16 Nov 2023

Zero-shot audio captioning with audio-language model guidance and audio context keywords

Leonard Salewski

Stefan Fauth

A. Sophia Koepke

Zeynep Akata

202

14 Nov 2023

Improving Image Captioning via Predicting Structured ConceptsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Ting Wang

Weidong Chen

Yuanhe Tian

Yan Song

Zhendong Mao

221

14 Nov 2023

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu

Jin Xu

Xiaohuan Zhou

Qian Yang

Shiliang Zhang

Zhijie Yan

Chang Zhou

Jingren Zhou

AuLLM

320

595

14 Nov 2023

Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

Leonard Salewski

A. Sophia Koepke

Hendrik P. A. Lensch

Zeynep Akata

210

08 Nov 2023