v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,326 papers shown

I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal DialogueAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

400

27 Feb 2025

Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

296

26 Feb 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human PreferenceAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

428

25 Feb 2025

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical ReasoningInternational Conference on Learning Representations (ICLR), 2025

408

25 Feb 2025

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

678

24 Feb 2025

SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

457

24 Feb 2025

LOVA3: Learning to Visual Question Answering, Asking and AssessmentNeural Information Processing Systems (NeurIPS), 2024

454

21 Feb 2025

Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank AdaptationInternational Conference on Multimedia Retrieval (ICMR), 2024

433

21 Feb 2025

ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval

1.2K

21 Feb 2025

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

596

20 Feb 2025

Contrastive Localized Language-Image Pre-Training

473

20 Feb 2025

Megrez-Omni Technical Report

...

261

19 Feb 2025

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

396

17 Feb 2025

How Blind and Low-Vision Individuals Prefer Large Vision-Language Model-Generated Scene Descriptions

337

15 Feb 2025

Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach

351

10 Feb 2025

Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality InversionInternational Conference on Learning Representations (ICLR), 2025

561

06 Feb 2025

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

539

03 Feb 2025

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

564

31 Jan 2025

Fine Tuning without Catastrophic Forgetting via Selective Low Rank Adaptation

295

28 Jan 2025

Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation

Ahmad Süleyman

Göksel Biricik

481

15 Jan 2025

OneLLM: One Framework to Align All Modalities with LanguageComputer Vision and Pattern Recognition (CVPR), 2023

692

222

10 Jan 2025

Classifier-Guided Captioning Across ModalitiesIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

254

03 Jan 2025

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding TasksInternational Conference on Learning Representations (ICLR), 2024

718

142

03 Jan 2025

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksNeural Information Processing Systems (NeurIPS), 2024

...

998

141

03 Jan 2025

Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image CaptioningEuropean Conference on Computer Vision (ECCV), 2024

341

03 Jan 2025

Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression ComprehensionAAAI Conference on Artificial Intelligence (AAAI), 2025

339

03 Jan 2025

ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers

211

31 Dec 2024

Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, EditingNeural Information Processing Systems (NeurIPS), 2024

639

31 Dec 2024

Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

1.1K

28 Dec 2024

To Predict or Not To Predict? Proportionally Masked Autoencoders for Tabular Data Imputation

Jungkyu Kim

Kibok Lee

Taeyoung Park

382

26 Dec 2024

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Meishan Zhang

610

108

22 Dec 2024

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language AlignmentComputer Vision and Pattern Recognition (CVPR), 2024

...

378

20 Dec 2024

Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data

445

19 Dec 2024

I0T: Embedding Standardization Method Towards Zero Modality GapAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

368

18 Dec 2024

LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer

...

410

18 Dec 2024

M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation

^3

-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationComputer Vision and Pattern Recognition (CVPR), 2024

465

18 Dec 2024

FLAIR: VLM with Fine-grained Language-informed Image RepresentationsComputer Vision and Pattern Recognition (CVPR), 2024

Rui Xiao

Sanghwan Kim

Mariana-Iuliana Georgescu

Zeynep Akata

Stephan Alaniz

VLM CLIP

350

04 Dec 2024

DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding

336

02 Dec 2024

CIA: Controllable Image Augmentation Framework Based on Stable DiffusionConference on Multimedia Information Processing and Retrieval (MIPR), 2024

Mohamed Benkedadra

Dany Rimez

Tiffanie Godelaine

Natarajan Chidambaram

Hamed Razavi Khosroshahi

382

25 Nov 2024

IterIS: Iterative Inference-Solving Alignment for LoRA MergingComputer Vision and Pattern Recognition (CVPR), 2024

490

21 Nov 2024

AI-generated Image Detection: Passive or Watermark?

578

20 Nov 2024

Joint Vision-Language Social Bias Removal for CLIPComputer Vision and Pattern Recognition (CVPR), 2024

457

19 Nov 2024

SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Global Game-Theoretic Analysis of Asymmetric Threats

777

17 Nov 2024

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations

Zacharie Delpierre Coudert

Kartikeya Upasani

Mahesh Pasupuleti

MLLM 3DH

288

102

15 Nov 2024

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

231

14 Nov 2024

AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

297

13 Nov 2024

No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 LanguagesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

250

06 Nov 2024

HumanVLM: Foundation for Human-Scene Vision-Language ModelInformation Fusion (Inf. Fusion), 2024

426

05 Nov 2024

Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language AttackIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

276

04 Nov 2024

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language NegativesNeural Information Processing Systems (NeurIPS), 2024

380

04 Nov 2024