v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

238

23 Feb 2024

Uncertainty-Aware Evaluation for Vision-Language Models

440

22 Feb 2024

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

390

20 Feb 2024

A Survey on Knowledge Distillation of Large Language Models

464

235

20 Feb 2024

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

542

20 Feb 2024

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Matthias Hein

389

19 Feb 2024

CIC: A Framework for Culturally-Aware Image Captioning

Youngsik Yun

Jihie Kim

VLM

416

08 Feb 2024

Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study

189

31 Jan 2024

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

Jianing Li

Xi Nan

Ming Lu

Li Du

Shanghang Zhang

148

31 Jan 2024

YOLO-World: Real-Time Open-Vocabulary Object Detection

Ying Shan

404

650

30 Jan 2024

PACE: A Pragmatic Agent for Enhancing Communication Efficiency Using Large Language Models

188

30 Jan 2024

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Jingdong Chen

Ming Yang

VLM MLLM

220

29 Jan 2024

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and ReasoningInternational Joint Conference on Artificial Intelligence (IJCAI), 2024

297

25 Jan 2024

SciMMIR: Benchmarking Scientific Multi-modal Information RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Siwei Wu

Yi Zhou

Kang Zhu

Ge Zhang

...

Noura Al Moubayed

244

24 Jan 2024

ChatterBox: Multi-round Multimodal Referring and Grounding

213

24 Jan 2024

Prompting Large Vision-Language Models for Compositional Reasoning

235

20 Jan 2024

Supervised Fine-tuning in turn Improves Visual Foundation Models

Chun Yuan

Ying Shan

VLM CLIP

250

18 Jan 2024

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

...

Yu Qiao

240

18 Jan 2024

Improving fine-grained understanding in image-text pre-training

...

220

18 Jan 2024

COCO is "ALL'' You Need for Visual Instruction Fine-tuningIEEE International Conference on Multimedia and Expo (ICME), 2024

Hongxia Yang

210

17 Jan 2024

KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain

265

16 Jan 2024

GroundingGPT:Language Enhanced Multi-modal Grounding ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

...

621

11 Jan 2024

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Fei Huang

191

11 Jan 2024

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

Jianbo Yuan

Hongxia Yang

315

146

10 Jan 2024

CaMML: Context-Aware Multimodal Learner for Large Models

277

06 Jan 2024

Object-oriented backdoor attack against image captioning

Sheng Li

173

05 Jan 2024

Towards Weakly Supervised Text-to-Audio Grounding

Kai Yu

356

05 Jan 2024

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

Xiangtai Li

309

04 Jan 2024

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

212

04 Jan 2024

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal AlignmentInternational Conference on Machine Learning (ICML), 2024

Ming Yang

233

04 Jan 2024

GPT-4V(ision) is a Generalist Web Agent, if GroundedInternational Conference on Machine Learning (ICML), 2024

Huan Sun

385

407

03 Jan 2024

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous DrivingAAAI Conference on Artificial Intelligence (AAAI), 2024

...

Xiaodan Liang

217

02 Jan 2024

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Kevin Lin

208

01 Jan 2024

Generating Enhanced Negatives for Training Language-Based Object DetectorsComputer Vision and Pattern Recognition (CVPR), 2023

453

29 Dec 2023

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal DistillationChinese Conference on Pattern Recognition and Computer Vision (CPRCV), 2023

281

29 Dec 2023

Video Understanding with Large Language Models: A Survey

...

717

170

29 Dec 2023

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

...

Chunhua Shen

312

28 Dec 2023

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

198

27 Dec 2023

Cycle-Consistency Learning for Captioning and Grounding

235

23 Dec 2023

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

420

22 Dec 2023

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Weijie Su

...

Ping Luo

Yu Qiao

641

2,182

21 Dec 2023

Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining

255

19 Dec 2023

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Ser-Nam Lim

386

19 Dec 2023

Context Disentangling and Prototype Inheriting for Robust Visual Grounding

Wei Tang

271

19 Dec 2023

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

474

101

17 Dec 2023

Pixel Aligned Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023

291

14 Dec 2023

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2023

Ji Zhang

326

14 Dec 2023

Exploration of visual prompt in Grounded pre-trained open-set detectionIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023

121

14 Dec 2023

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

460

11 Dec 2023

MAFA: Managing False Negatives for Vision-Language Pre-training

414

11 Dec 2023