v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs

122

30 Sep 2025

MuSLR: Multimodal Symbolic Logical Reasoning

...

130

30 Sep 2025

Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

141

30 Sep 2025

OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

29 Sep 2025

ColLab: A Collaborative Spatial Progressive Data Engine for Referring Expression Comprehension and Generation

125

28 Sep 2025

Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning

243

26 Sep 2025

Deepfakes: we need to re-think the concept of "real" images

J. Keuper

Margret Keuper

139

26 Sep 2025

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs

Israfel Salazar

Desmond Elliott

Yova Kementchedjhieva

CoGe VLM

230

23 Sep 2025

OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment

Teng Xiao

Zuchao Li

Lefei Zhang

187

23 Sep 2025

Speech-to-See: End-to-End Speech-Driven Open-Set Object Detection

20 Sep 2025

RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning

136

19 Sep 2025

MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

128

18 Sep 2025

Efficient Multimodal Dataset Distillation via Generative Models

287

18 Sep 2025

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

...

127

17 Sep 2025

Evaluating Robustness of Vision-Language Models Under Noisy Conditions

Purushoth

Alireza

AAML

15 Sep 2025

Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos

Eda B. Özyiğit

ObjD

324

12 Sep 2025

Recurrence Meets Transformers for Universal Multimodal Retrieval

188

10 Sep 2025

Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding

174

08 Sep 2025

Integrating Spatial and Semantic Embeddings for Stereo Sound Event Localization in Videos

Davide Berghi

Philip J. B. Jackson

111

08 Sep 2025

Effectively obtaining acoustic, visual and textual data from videos

Jorge E. León

Miguel Carrasco

VGen

139

06 Sep 2025

Semantic-guided LoRA Parameters Generation

116

05 Sep 2025

Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation

146

03 Sep 2025

EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

31 Aug 2025

VoCap: Video Object Captioning and Segmentation from Any Prompt

261

29 Aug 2025

Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval

22 Aug 2025

RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution

126

22 Aug 2025

Towards Open World Detection: A Survey

Andrei-Stefan Bulzan

Cosmin Cernazanu-Glavan

ObjD VLM

220

22 Aug 2025

Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

174

20 Aug 2025

Understanding Data Influence with Differential Approximation

283

20 Aug 2025

7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models

111

18 Aug 2025

Region-Level Context-Aware Multimodal Understanding

165

17 Aug 2025

Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

145

15 Aug 2025

JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics

294

14 Aug 2025

Bridging Modality Gaps in e-Commerce Products via Vision-Language Alignment

192

13 Aug 2025

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

12 Aug 2025

ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

126

11 Aug 2025

MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

214

10 Aug 2025

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

287

08 Aug 2025

Adapting Vision-Language Models Without Labels: A Comprehensive Survey

219

07 Aug 2025

Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

115

06 Aug 2025

ChartCap: Mitigating Hallucination of Dense Chart Captioning

Junyoung Lim

Jaewoo Ahn

Gunhee Kim

128

05 Aug 2025

VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

397

04 Aug 2025

Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment

Dahun Kim

A. Angelova

VLM

232

03 Aug 2025

Eigen Neural Network: Unlocking Generalizable Vision with Eigenbasis

225

02 Aug 2025

Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

142

01 Aug 2025

SPRINT: Scalable and Predictive Intent Refinement for LLM-Enhanced Session-based Recommendation

Dong Wang

SeongKu Kang

191

01 Aug 2025

Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Kaiyan Zhao

Zhongtao Miao

Yoshimasa Tsuruoka

102

01 Aug 2025

Multimodal Referring Segmentation: A Survey

394

01 Aug 2025

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

384

31 Jul 2025

On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

210

30 Jul 2025