v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual GroundingEuropean Conference on Computer Vision (ECCV), 2024

Wei Chen

Mahdieh Hatamian

Yu Wu

241

02 Aug 2024

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Luke Zettlemoyer

279

31 Jul 2024

FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis

Mikel Williams-Lekuona

Georgina Cosma

230

29 Jul 2024

Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval

226

28 Jul 2024

Unified Lexical Representation for Interpretable Visual-Language Alignment

211

25 Jul 2024

Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Jingdong Chen

Ming Yang

LRM

226

22 Jul 2024

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

251

21 Jul 2024

Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models

Md Zarif Hossain

Ahmed Imteaj

VLM AAML

189

20 Jul 2024

Learning Visual Grounding from Generative Vision and Language Model

Shijie Wang

286

18 Jul 2024

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

268

18 Jul 2024

Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

243

18 Jul 2024

OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping

309

18 Jul 2024

Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

254

17 Jul 2024

Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval

279

17 Jul 2024

Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights

310

16 Jul 2024

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

273

15 Jul 2024

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Errui Ding

Jingdong Wang

101

15 Jul 2024

Position: Measure Dataset Diversity, Don't Just Claim It

Dora Zhao

Jerone T. A. Andrews

Orestis Papakyriakopoulos

Alice Xiang

275

11 Jul 2024

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Jie Wu

206

10 Jul 2024

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Yuxin Chen

Chunfeng Yuan

Ying Shan

154

10 Jul 2024

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

Seonghoon Yu

Paul Hongsuck Seo

Jeany Son

DiffM

417

10 Jul 2024

A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends

Xiaoye Qu

Wei Hu

344

10 Jul 2024

LEMoN: Label Error Detection using Multimodal Neighbors

404

10 Jul 2024

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Pavan Kumar Anasosalu Vasu

383

09 Jul 2024

A Single Transformer for Scalable Vision-Language Modeling

294

08 Jul 2024

OneDiff: A Generalist Model for Image Difference Captioning

529

08 Jul 2024

MobileFlow: A Multimodal LLM For Mobile GUI Agent

161

05 Jul 2024

ACTRESS: Active Retraining for Semi-supervised Visual Grounding

Weitai Kang

Mengxue Qu

Yunchao Wei

Yan Yan

326

03 Jul 2024

Visual Grounding with Attention-Driven Constraint Balancing

Weitai Kang

290

03 Jul 2024

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

Weitai Kang

Gaowen Liu

Mubarak Shah

Yan Yan

ObjD

419

03 Jul 2024

Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective

312

03 Jul 2024

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

Yixin Wang

225

02 Jul 2024

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Dinesh Manocha

412

01 Jul 2024

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Guanting Dong

...

Chen Li

291

164

01 Jul 2024

Tarsier: Recipes for Training and Evaluating Large Video Description Models

Jiawei Wang

Liping Yuan

Yuchen Zhang

304

114

30 Jun 2024

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

334

28 Jun 2024

Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models

Nila Masrourisaadat

Nazanin Sedaghatkish

Fatemeh Sarshartehrani

Edward A. Fox

347

28 Jun 2024

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

Yicheng Chen

Xiangtai Li

Yining Li

Kai Chen

427

28 Jun 2024

A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)

278

27 Jun 2024

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

278

26 Jun 2024

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

Sedigheh Eslami

Gerard de Melo

VLM

322

25 Jun 2024

Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

255

24 Jun 2024

Review of Zero-Shot and Few-Shot AI Algorithms in The Medical Domain

Maged Badawi

Mohammedyahia Abushanab

Sheethal Bhat

Andreas Maier

VLM

271

23 Jun 2024

Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?

Gregor Geigle

Radu Timofte

Goran Glavaš

260

20 Jun 2024

Revealing Vision-Language Integration in the Brain with Multimodal Networks

Boris Katz

247

20 Jun 2024

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

Jie Zhang

Sibo Wang

Xiangkui Cao

Zheng Yuan

Shiguang Shan

Xilin Chen

Wen Gao

VLM

377

20 Jun 2024

Composing Object Relations and Attributes for Image-Text Matching

Abhinav Shrivastava

267

17 Jun 2024

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Yujie Lu

Dongfu Jiang

Wenhu Chen

William Yang Wang

Yejin Choi

Bill Yuchen Lin

VLM

442

16 Jun 2024

First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

261

14 Jun 2024

Explore the Limits of Omni-modal Pretraining at Scale

Handong Li

251

13 Jun 2024