v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015

Bryan A. Plummer

Liwei Wang

Christopher M. Cervantes

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown

Contrastive Learning of Sentence Embeddings from ScratchConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

376

24 May 2023

SmartTrim: Adaptive Tokens and Attention Pruning for Efficient Vision-Language ModelsInternational Conference on Language Resources and Evaluation (LREC), 2023

236

24 May 2023

Meta-learning For Vision-and-language Cross-lingual Transfer

Hanxu Hu

Frank Keller

VLM

156

24 May 2023

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

Ahmed Hassan Awadallah

Damien Jose

Xiang Ren

ObjD VLM

237

24 May 2023

Mitigating Test-Time Bias for Fair Image RetrievalNeural Information Processing Systems (NeurIPS), 2023

Fanjie Kong

Shuai Yuan

Weituo Hao

Ricardo Henao

199

23 May 2023

ReadMe++: Benchmarking Multilingual Language Models for Multi-Domain Readability AssessmentConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

312

23 May 2023

VisorGPT: Learning Visual Prior via Generative Pre-TrainingNeural Information Processing Systems (NeurIPS), 2023

782

23 May 2023

UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning

114

23 May 2023

RaSa: Relation and Sensitivity Aware Representation Learning for Text-based Person SearchInternational Joint Conference on Artificial Intelligence (IJCAI), 2023

Min Zhang

262

105

23 May 2023

EDIS: Entity-Driven Image Search over Multimodal Web ContentConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

337

23 May 2023

Type-to-Track: Retrieve Any Object via Prompt-based TrackingNeural Information Processing Systems (NeurIPS), 2023

285

22 May 2023

DiffCap: Exploring Continuous Diffusion on Image Captioning

Zefan Cai

205

20 May 2023

Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature IndividualizationInternational Conference on Machine Learning (ICML), 2023

257

19 May 2023

Going Denser with Open-Vocabulary Part SegmentationIEEE International Conference on Computer Vision (ICCV), 2023

Ping Luo

235

18 May 2023

Weakly-Supervised Visual-Textual Grounding with Semantic Prior RefinementBritish Machine Vision Conference (BMVC), 2023

192

18 May 2023

Iterative Adversarial Attack on Image-guided Story Ending GenerationIEEE transactions on multimedia (IEEE TMM), 2023

Youze Wang

Wenbo Hu

Richang Hong

248

16 May 2023

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual GroundingIEEE transactions on multimedia (IEEE TMM), 2023

Linhui Xiao

Xiaoshan Yang

Fang Peng

Ming Yan

Yaowei Wang

Changsheng Xu

ObjD VLM

472

15 May 2023

Parameter-efficient Tuning of Large-scale Multimodal Foundation ModelNeural Information Processing Systems (NeurIPS), 2023

Xiao Luo

301

15 May 2023

A Comprehensive Survey on Segment Anything Model for Vision and Beyond

421

131

14 May 2023

Measuring Progress in Fine-grained Vision-and-Language UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

232

12 May 2023

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision TransformersComputer Vision and Pattern Recognition (CVPR), 2023

425

112

11 May 2023

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual CluesAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Yunxin Li

Baotian Hu

Xinyu Chen

Yuxin Ding

Lin Ma

Min Zhang

LRM

170

08 May 2023

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch SelectionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

361

08 May 2023

UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese

Doanh C. Bui

Nghia Hieu Nguyen

Khang Phuoc-Quy Nguyen

VLM

242

07 May 2023

X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages

Minglun Han

Bo Xu

343

152

07 May 2023

LMEye: An Interactive Perception Network for Large Language ModelsIEEE transactions on multimedia (IEEE TMM), 2023

Baotian Hu

Lin Ma

290

05 May 2023

ArK: Augmented Reality with Knowledge Interactive Emergent Ability

...

Yejin Choi

197

01 May 2023

An Empirical Study of Multimodal Model MergingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

337

28 Apr 2023

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

425

27 Apr 2023

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal MappingInternational Joint Conference on Artificial Intelligence (IJCAI), 2023

301

26 Apr 2023

OmniLabel: A Challenging Benchmark for Language-Based Object DetectionIEEE International Conference on Computer Vision (ICCV), 2023

S. Schulter

G. VijayKumarB.

Yumin Suh

Konstantinos M. Dafnis

187

22 Apr 2023

RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models

402

21 Apr 2023

Chain of Thought Prompt Tuning in Vision Language Models

Shanghang Zhang

290

16 Apr 2023

RECLIP: Resource-efficient CLIP by Training with Small Images

281

12 Apr 2023

MoMo: A shared encoder Model for text, image and multi-Modal representations

126

11 Apr 2023

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction FollowingConference on Robot Learning (CoRL), 2023

Mingyu Ding

Yan Xu

Ping Luo

Chuang Gan

207

07 Apr 2023

DATE: Domain Adaptive Product Seeker for E-commerceComputer Vision and Pattern Recognition (CVPR), 2023

Zhou Zhao

316

07 Apr 2023

Training-Free Layout Control with Cross-Attention GuidanceIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

Minghao Chen

Iro Laina

Andrea Vedaldi

DiffM

449

318

06 Apr 2023

Uncurated Image-Text Datasets: Shedding Light on Demographic BiasComputer Vision and Pattern Recognition (CVPR), 2023

202

06 Apr 2023

Multi-Modal Representation Learning with Text-Driven Soft MasksComputer Vision and Pattern Recognition (CVPR), 2023

Jaeyoo Park

Bohyung Han

SSL

185

03 Apr 2023

Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Yongshuo Zong

Oisin Mac Aodha

Timothy M. Hospedales

SSL

346

31 Mar 2023

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

...

385

29 Mar 2023

Unmasked Teacher: Towards Training-Efficient Video Foundation ModelsIEEE International Conference on Computer Vision (ICCV), 2023

Yi Wang

Yu Qiao

536

238

28 Mar 2023

Equivariant Similarity for Vision-Language Foundation ModelsIEEE International Conference on Computer Vision (ICCV), 2023

Hanwang Zhang

Zicheng Liu

Lijuan Wang

CoGe

282

25 Mar 2023

CoBIT: A Contrastive Bi-directional Image-Text Generation ModelInternational Conference on Learning Representations (ICLR), 2023

213

23 Mar 2023

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference UnderstandingAAAI Conference on Artificial Intelligence (AAAI), 2023

Zheng Wang

174

23 Mar 2023

Top-Down Visual Attention from Analysis by SynthesisComputer Vision and Pattern Recognition (CVPR), 2023

Baifeng Shi

Trevor Darrell

Xin Eric Wang

227

23 Mar 2023

VMCML: Video and Music Matching via Cross-Modality Lifting

164

22 Mar 2023

MAGVLT: Masked Generative Vision-and-Language TransformerComputer Vision and Pattern Recognition (CVPR), 2023

140

21 Mar 2023

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer LearningInternational Conference on Learning Representations (ICLR), 2023

Zaid Khan

Yun Fu

VLM

182

21 Mar 2023