SPICE: Semantic Propositional Image Caption Evaluation

29 July 2016

Papers citing "SPICE: Semantic Propositional Image Caption Evaluation"

50 / 1,002 papers shown

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural ArtifactsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

847

21 Feb 2025

Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions

1.1K

18 Feb 2025

MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding

397

18 Feb 2025

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Mohammad Mahdi Abootorabi

Amirhosein Zobeiri

Mahdi Dehghani

Mohammadali Mohammadkhani

709

12 Feb 2025

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

275

08 Feb 2025

A Video-grounded Dialogue Dataset and Metric for Event-driven ActivitiesAAAI Conference on Artificial Intelligence (AAAI), 2025

Wiradee Imrattanatrai

279

30 Jan 2025

Mobile Manipulation Instruction Generation from Multiple Images with Automatic Metric EnhancementIEEE Robotics and Automation Letters (IEEE RA-L), 2025

420

28 Jan 2025

An Ensemble Model with Attention Based Mechanism for Image CaptioningComputers & electrical engineering (Comput. Electr. Eng.), 2025

Israa Al Badarneh

Bassam Hammo

Omar Al-Kadi

367

28 Jan 2025

DriveLM: Driving with Graph Visual Question AnsweringEuropean Conference on Computer Vision (ECCV), 2023

Chonghao Sima

Katrin Renz

Kashyap Chitta

Lawrence Yunliang Chen

802

348

17 Jan 2025

Classifier-Guided Captioning Across ModalitiesIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

219

03 Jan 2025

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image CaptioningEuropean Conference on Computer Vision (ECCV), 2024

277

03 Jan 2025

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLMComputer Vision and Pattern Recognition (CVPR), 2024

...

412

31 Dec 2024

ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers

165

31 Dec 2024

Multi-Agent Planning Using Visual Language ModelsEuropean Conference on Artificial Intelligence (ECAI), 2024

260

31 Dec 2024

From Hallucinations to Facts: Enhancing Language Models with Curated Knowledge Graphs

228

24 Dec 2024

SCBench: A Sports Commentary Benchmark for Video LLMs

Kuangzhi Ge

Lawrence Yunliang Chen

230

23 Dec 2024

Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track

D. Gupta

Dina Demner-Fushman

LM&MA

238

15 Dec 2024

Learning to Merge Tokens via Decoupled Embedding for Efficient Vision TransformersNeural Information Processing Systems (NeurIPS), 2024

Dong Hoon Lee

Seunghoon Hong

230

13 Dec 2024

Neptune: The Long Orbit to Benchmarking Long Video Understanding

...

442

12 Dec 2024

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

...

284

12 Dec 2024

CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs

256

03 Dec 2024

DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

302

02 Dec 2024

Detailed Object Description with Controllable DimensionsIEEE transactions on multimedia (IEEE TMM), 2024

346

28 Nov 2024

VideoOrion: Tokenizing Object Dynamics in Videos

Sipeng Zheng

Zongqing Lu

395

25 Nov 2024

EVQAScore: A Fine-grained Metric for Video Question Answering Data Quality Evaluation

Hao Liang

Zirong Chen

Feiyu Xiong

Wentao Zhang

309

11 Nov 2024

ViTOC: Vision Transformer and Object-aware Captioner

Feiyang Huang

391

09 Nov 2024

Analyzing The Language of Visual Tokens

105

07 Nov 2024

Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language AttackIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

232

04 Nov 2024

TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models

Georgia Gabriela Sampaio

266

02 Nov 2024

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Satvik Dixit

Soham Deshmukh

Bhiksha Raj

249

01 Nov 2024

Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

278

29 Oct 2024

Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable SensorsProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2024

267

26 Oct 2024

SceneGraMMi: Scene Graph-boosted Hybrid-fusion for Multi-Modal Misinformation Veracity Prediction

Ponnurangam Kumaraguru

178

20 Oct 2024

EVA: An Embodied World Model for Future Video Anticipation

...

229

20 Oct 2024

Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-ImageNeural Information Processing Systems (NeurIPS), 2024

263

20 Oct 2024

EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation

Mithun Manivannan

Vignesh Nethrapalli

Mark Cartwright

160

15 Oct 2024

SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image EditingACM Transactions on Graphics (TOG), 2024

315

15 Oct 2024

Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models

Yang Liu

272

15 Oct 2024

Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent ApproachNeural Information Processing Systems (NeurIPS), 2024

Rory Young

Nicolas Pugeault

AAML

359

14 Oct 2024

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Ziyang Ma

Kai Yu

269

12 Oct 2024

DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio CaptioningIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Ziyang Ma

304

12 Oct 2024

Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI TechnologiesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

226

11 Oct 2024

A Unified Debiasing Approach for Vision-Language Models across Modalities and TasksNeural Information Processing Systems (NeurIPS), 2024

199

10 Oct 2024

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and TrainingInternational Journal of Computer Vision (IJCV), 2024

287

09 Oct 2024

NaVIP: An Image-Centric Indoor Navigation Solution for Visually Impaired People

298

08 Oct 2024

The Mystery of Compositional Generalization in Graph-based Generative Commonsense ReasoningConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Xiyan Fu

Anette Frank

LRM

443

08 Oct 2024

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

201

08 Oct 2024

R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?

Chunyi Li

Junxuan Zhang

Zicheng Zhang

H. Wu

Yuan Tian

...

Guo Lu

Xiaohong Liu

Xiongkuo Min

Weisi Lin

Guangtao Zhai

AAML

175

07 Oct 2024

CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News DetectionAsian Conference on Computer Vision (ACCV), 2024

Devank

Jayateja Kalla

Soma Biswas

178

06 Oct 2024

AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkInternational Conference on Learning Representations (ICLR), 2024

Christopher D. Manning

3DV

649

04 Oct 2024