v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept AcquisitionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Guangyue Xu

Parisa Kordjamshidi

Joyce Chai

162

02 Nov 2023

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningInternational Conference on Computational Linguistics (COLING), 2023

524

02 Nov 2023

CapsFusion: Rethinking Image-Text Data at ScaleComputer Vision and Pattern Recognition (CVPR), 2023

371

31 Oct 2023

MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval

369

30 Oct 2023

Women Wearing Lipstick: Measuring the Bias Between an Object and Its Related GenderConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Ahmed Sabir

Lluís Padró

353

29 Oct 2023

CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud DataNeural Information Processing Systems (NeurIPS), 2023

227

28 Oct 2023

GROOViST: A Metric for Grounding Objects in Visual StorytellingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Aditya K Surikuchi

Sandro Pezzelle

Raquel Fernández

152

26 Oct 2023

Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

279

26 Oct 2023

RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open EnvironmentsNeural Information Processing Systems (NeurIPS), 2023

Jingkuan Song

243

26 Oct 2023

Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching NetworkIndustrial Conference on Data Mining (IDM), 2023

Yiming Lin

Xiao-Bo Jin

Qiufeng Wang

Kaizhu Huang

159

25 Oct 2023

Video Referring Expression Comprehension via Transformer with Content-conditioned Query

273

25 Oct 2023

TiC-CLIP: Continual Training of CLIP ModelsInternational Conference on Learning Representations (ICLR), 2023

361

24 Oct 2023

Localizing Active Objects from Egocentric Vision with Symbolic World KnowledgeConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Te-Lin Wu

Yu Zhou

Nanyun Peng

194

23 Oct 2023

Open-Set Image Tagging with Multi-Grained Text Supervision

Xinyu Huang

Yi-Jie Huang

Youcai Zhang

Weiwei Tian

Rui Feng

Lei Zhang

249

23 Oct 2023

OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

Xiangtai Li

278

22 Oct 2023

ITEm: Unsupervised Image-Text Embedding Learning for eCommerce

Baohao Liao

Michael Kozielski

Sanjika Hewavitharana

132

22 Oct 2023

On the Transferability of Visually Grounded PCFGsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Yanpeng Zhao

Ivan Titov

147

21 Oct 2023

CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages

...

Esther Colombini

193

20 Oct 2023

Semi-supervised multimodal coreference resolution in image narrations

218

20 Oct 2023

Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation

299

20 Oct 2023

InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution

Xiangru Jian

Yimu Wang

255

20 Oct 2023

On the Language Encoder of Contrastive Cross-modal Models

156

20 Oct 2023

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

431

19 Oct 2023

Evaluating the Fairness of Discriminative Foundation Models in Computer VisionAAAI/ACM Conference on AI, Ethics, and Society (AIES), 2023

Junaid Ali

Matthäus Kleindessner

248

18 Oct 2023

Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery BanksConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Yimu Wang

Xiangru Jian

Bo Xue

204

17 Oct 2023

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang

447

269

17 Oct 2023

NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Jiayi Ji

353

17 Oct 2023

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Raghuraman Krishnamoorthi

1.5K

631

14 Oct 2023

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

412

13 Oct 2023

Incremental Object Detection with CLIP

Ziyue Huang

Yupeng He

Qingjie Liu

Yunhong Wang

CLL ObjD VLM

299

13 Oct 2023

Ferret: Refer and Ground Anything Anywhere at Any GranularityInternational Conference on Learning Representations (ICLR), 2023

Xianzhi Du

421

455

11 Oct 2023

VeCLIP: Improving CLIP Training via Visual-enriched CaptionsEuropean Conference on Computer Vision (ECCV), 2023

...

365

11 Oct 2023

TextPSG: Panoptic Scene Graph Generation from Textual DescriptionsIEEE International Conference on Computer Vision (ICCV), 2023

Chengyang Zhao

Songlin Yang

Zhenfang Chen

Mingyu Ding

Chuang Gan

393

10 Oct 2023

InstructDET: Diversifying Referring Object Detection with Generalized InstructionsInternational Conference on Learning Representations (ICLR), 2023

...

441

08 Oct 2023

Lightweight In-Context Tuning for Multimodal Unified Models

144

08 Oct 2023

Envisioning Narrative Intelligence: A Creative Visual Storytelling AnthologyInternational Conference on Human Factors in Computing Systems (CHI), 2023

Brett A. Halperin

S. Lukin

CoGe

214

06 Oct 2023

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2023

Yi-Lin Sung

Jaehong Yoon

Mohit Bansal

VLM

282

04 Oct 2023

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense PredictionInternational Conference on Learning Representations (ICLR), 2023

Xiangtai Li

Wentao Liu

Chen Change Loy

CLIP VLM

250

104

02 Oct 2023

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

298

02 Oct 2023

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIPInternational Conference on Learning Representations (ICLR), 2023

Quanquan Gu

395

02 Oct 2023

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMsComputer Vision and Pattern Recognition (CVPR), 2023

Ming Yang

270

01 Oct 2023

Black-box Attacks on Image Activity Prediction and its Natural Language Explanations

227

30 Sep 2023

Region-centric Image-Language Pretraining for Open-Vocabulary DetectionEuropean Conference on Computer Vision (ECCV), 2023

257

29 Sep 2023

Retail-786k: a Large-Scale Dataset for Visual Entity Matching

Bianca Lamm

Janis Keuper

VLM

243

29 Sep 2023

A Survey on Image-text Multimodal Models

Ruifeng Guo

Jingxuan Wei

Linzhuang Sun

Khai-Nguyen Nguyen

Guiyong Chang

Dawei Liu

Sibo Zhang

Zhengbing Yao

Mingjun Xu

Liping Bu

VLM

328

23 Sep 2023

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight InheritanceIEEE International Conference on Computer Vision (ICCV), 2023

...

257

21 Sep 2023

Multi3DRefer: Grounding Text Description to Multiple 3D ObjectsIEEE International Conference on Computer Vision (ICCV), 2023

Yiming Zhang

ZeMing Gong

Angel X. Chang

397

134

11 Sep 2023

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image CaptioningInternational Conference on Language Resources and Evaluation (LREC), 2023

Zhengcong Fei

265

10 Sep 2023

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual TokenizationInternational Conference on Learning Representations (ICLR), 2023

Kun Xu

...

257

09 Sep 2023

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual GroundingEuropean Conference on Computer Vision (ECCV), 2023

Ozan Unal

Daniel Gehrig

Suman Saha

Luc Van Gool

288

08 Sep 2023