v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales

Liang Pang

161

17 Apr 2024

HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision

Siddhant Bansal

Michael Wray

Dima Damen

219

15 Apr 2024

RankCLIP: Ranking-Consistent Language-Image Pretraining

Zhaorun Chen

405

15 Apr 2024

Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking

Tianyu Zhu

M. Jung

Jesse Clark

430

12 Apr 2024

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Chen Chen

...

285

11 Apr 2024

How is Visual Attention Influenced by Text Guidance? Database and Model

Yinan Sun

Xiongkuo Min

Huiyu Duan

Guangtao Zhai

260

11 Apr 2024

To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO

288

06 Apr 2024

Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness

314

05 Apr 2024

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

Michael Stephen Saxon

304

05 Apr 2024

Would Deep Generative Models Amplify Bias in Future Models?Computer Vision and Pattern Recognition (CVPR), 2024

216

04 Apr 2024

Text-driven Affordance Learning from Egocentric Vision

286

03 Apr 2024

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration

285

03 Apr 2024

CosmicMan: A Text-to-Image Foundation Model for Humans

284

01 Apr 2024

Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models

Hidetaka Kamigaito

Taro Watanabe

244

29 Mar 2024

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Shanghang Zhang

402

29 Mar 2024

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Siyuan Qiao

307

28 Mar 2024

J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution

Sadao Kurohashi

185

28 Mar 2024

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Han Xiao

346

216

25 Mar 2024

InternVideo2: Scaling Video Foundation Models for Multimodal Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024

...

Yifei Huang

Yu Qiao

Yali Wang

Limin Wang

262

104

22 Mar 2024

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models

Qiong Wu

188

22 Mar 2024

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery

Zhen Chen

Jinlin Wu

Mobarakol Islam

Hongbin Liu

Hongliang Ren

350

22 Mar 2024

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

285

21 Mar 2024

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

Xingyuan Dai

Yisheng Lv

216

20 Mar 2024

WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar

Shanliang Yao

...

378

19 Mar 2024

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

Ivor Tsang

337

19 Mar 2024

A Survey on Quality Metrics for Text-to-Image GenerationIEEE Transactions on Visualization and Computer Graphics (TVCG), 2024

Timo Ropinski

298

18 Mar 2024

Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction

Zhaoyu Chen

171

16 Mar 2024

Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning

Ser-Nam Lim

261

15 Mar 2024

GiT: Towards Generalist Vision Transformer through Universal Language InterfaceEuropean Conference on Computer Vision (ECCV), 2024

Muhammad Ferjad Naeem

Jiaming Song

Bernt Schiele

Liwei Wang

VLM

280

14 Mar 2024

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

294

14 Mar 2024

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error MetricComputer Vision and Pattern Recognition (CVPR), 2024

165

12 Mar 2024

Synth

^2

: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Christos Kaplanis

238

12 Mar 2024

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language ModelsEuropean Conference on Computer Vision (ECCV), 2024

Shuai Bai

Chang Zhou

Baobao Chang

MLLM VLM

342

333

11 Mar 2024

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

330

08 Mar 2024

Effectiveness Assessment of Recent Large Vision-Language Models

Fahad Shahbaz Khan

535

07 Mar 2024

MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding

360

05 Mar 2024

Detecting Concrete Visual Tokens for Multimodal Machine Translation

260

05 Mar 2024

Adding Multimodal Capabilities to a Text-only Translation Model

276

05 Mar 2024

Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception

Jun-Yan He

Yifan Wang

Lijun Wang

Huchuan Lu

Jun-Yan He

229

05 Mar 2024

Differentially Private Representation Learning via Image Captioning

279

04 Mar 2024

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan

Jaemin Cho

Elias Stengel-Eskin

Mohit Bansal

VLM ObjD

316

04 Mar 2024

Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models

Meiling Li

Zhenxing Qian

Xinpeng Zhang

295

03 Mar 2024

Can Transformers Capture Spatial Relations between Objects?

178

01 Mar 2024

Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset

Ander Salaberria

Gorka Azkune

Oier López de Lacalle

199

01 Mar 2024

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

...

Yu Qiao

318

29 Feb 2024

How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding

234

29 Feb 2024

Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control

Thong Nguyen

Mariya Hendriksen

Andrew Yates

Maarten de Rijke

190

27 Feb 2024

Probing Multimodal Large Language Models for Global and Local Semantic Representations

Kun Xu

Dongyan Zhao

322

27 Feb 2024

Measuring Vision-Language STEM Skills of Neural Models

426

27 Feb 2024

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

376

26 Feb 2024