ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.04870
  4. Cited By
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
ArXiv (abs)PDFHTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown
MetaReVision: Meta-Learning with Retrieval for Visually Grounded
  Compositional Concept Acquisition
MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept AcquisitionConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Guangyue Xu
Parisa Kordjamshidi
Joyce Chai
162
2
0
02 Nov 2023
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction TuningInternational Conference on Computational Linguistics (COLING), 2023
Yifan Du
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Jinpeng Wang
Chuyuan Wang
Mingchen Cai
Ruihua Song
Ji-Rong Wen
VLMMLLMLRM
524
28
0
02 Nov 2023
CapsFusion: Rethinking Image-Text Data at Scale
CapsFusion: Rethinking Image-Text Data at ScaleComputer Vision and Pattern Recognition (CVPR), 2023
Qiying Yu
Quan-Sen Sun
Xiaosong Zhang
Yufeng Cui
Fan Zhang
Yue Cao
Xinlong Wang
Jingjing Liu
VLM
371
88
0
31 Oct 2023
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient
  image-text retrieval
MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval
Youbo Lei
Feifei He
Chen Chen
Yingbin Mo
Sijia Li
Defeng Xie
H. Lu
VLM
369
2
0
30 Oct 2023
Women Wearing Lipstick: Measuring the Bias Between an Object and Its
  Related Gender
Women Wearing Lipstick: Measuring the Bias Between an Object and Its Related GenderConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Ahmed Sabir
Lluís Padró
353
3
0
29 Oct 2023
CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale
  Point Cloud Data
CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud DataNeural Information Processing Systems (NeurIPS), 2023
Taiki Miyanishi
Fumiya Kitamori
Shuhei Kurita
Jungdae Lee
M. Kawanabe
Nakamasa Inoue
AI4TS3DPC
227
15
0
28 Oct 2023
GROOViST: A Metric for Grounding Objects in Visual Storytelling
GROOViST: A Metric for Grounding Objects in Visual StorytellingConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Aditya K Surikuchi
Sandro Pezzelle
Raquel Fernández
152
14
0
26 Oct 2023
Evaluating Bias and Fairness in Gender-Neutral Pretrained
  Vision-and-Language Models
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Laura Cabello
Emanuele Bugliarello
Stephanie Brandl
Desmond Elliott
279
8
0
26 Oct 2023
RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open
  Environments
RIO: A Benchmark for Reasoning Intention-Oriented Objects in Open EnvironmentsNeural Information Processing Systems (NeurIPS), 2023
Mengxue Qu
Yu-Huan Wu
Wu Liu
Xiaodan Liang
Jingkuan Song
Yao-Min Zhao
Yunchao Wei
243
19
0
26 Oct 2023
Context Does Matter: End-to-end Panoptic Narrative Grounding with
  Deformable Attention Refined Matching Network
Context Does Matter: End-to-end Panoptic Narrative Grounding with Deformable Attention Refined Matching NetworkIndustrial Conference on Data Mining (IDM), 2023
Yiming Lin
Xiao-Bo Jin
Qiufeng Wang
Kaizhu Huang
159
5
0
25 Oct 2023
Video Referring Expression Comprehension via Transformer with
  Content-conditioned Query
Video Referring Expression Comprehension via Transformer with Content-conditioned Query
Jiang Ji
Meng Cao
Tengtao Song
Long Chen
Yi Wang
Yuexian Zou
273
6
0
25 Oct 2023
TiC-CLIP: Continual Training of CLIP Models
TiC-CLIP: Continual Training of CLIP ModelsInternational Conference on Learning Representations (ICLR), 2023
Saurabh Garg
Mehrdad Farajtabar
Hadi Pouransari
Raviteja Vemulapalli
Sachin Mehta
Oncel Tuzel
Vaishaal Shankar
Fartash Faghri
VLMCLIP
361
40
0
24 Oct 2023
Localizing Active Objects from Egocentric Vision with Symbolic World
  Knowledge
Localizing Active Objects from Egocentric Vision with Symbolic World KnowledgeConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Te-Lin Wu
Yu Zhou
Nanyun Peng
194
10
0
23 Oct 2023
Open-Set Image Tagging with Multi-Grained Text Supervision
Open-Set Image Tagging with Multi-Grained Text Supervision
Xinyu Huang
Yi-Jie Huang
Youcai Zhang
Weiwei Tian
Rui Feng
Yuejie Zhang
Yanchun Xie
Yaqian Li
Lei Zhang
VLM
249
64
0
23 Oct 2023
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
Chunlei Wang
Wenquan Feng
Xiangtai Li
Guangliang Cheng
Shuchang Lyu
Binghao Liu
Lijiang Chen
Qi Zhao
ObjDVLM
278
14
0
22 Oct 2023
ITEm: Unsupervised Image-Text Embedding Learning for eCommerce
ITEm: Unsupervised Image-Text Embedding Learning for eCommerce
Baohao Liao
Michael Kozielski
Sanjika Hewavitharana
Jiangbo Yuan
Shahram Khadivi
Tomer Lancewicki
SSL
132
0
0
22 Oct 2023
On the Transferability of Visually Grounded PCFGs
On the Transferability of Visually Grounded PCFGsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yanpeng Zhao
Ivan Titov
147
1
0
21 Oct 2023
CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP
  Performance on Low-Resource Languages
CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages
G. O. D. Santos
Diego A. B. Moreira
Alef Iury Ferreira
Jhessica Silva
Luiz Pereira
...
H. Maia
Nádia Da Silva
Esther Colombini
Hélio Pedrini
Sandra Avila
VLMCLIP
193
7
0
20 Oct 2023
Semi-supervised multimodal coreference resolution in image narrations
Semi-supervised multimodal coreference resolution in image narrations
A. Goel
Basura Fernando
Frank Keller
Hakan Bilen
218
6
0
20 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network
  for VL Representation
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
299
1
0
20 Oct 2023
InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution
InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution
Xiangru Jian
Yimu Wang
255
6
0
20 Oct 2023
On the Language Encoder of Contrastive Cross-modal Models
On the Language Encoder of Contrastive Cross-modal Models
Mengjie Zhao
Junya Ono
Zhi-Wei Zhong
Chieh-Hsin Lai
Yuhta Takida
Naoki Murata
Wei-Hsiang Liao
Takashi Shibuya
Hiromi Wakaki
Yuki Mitsufuji
VLM
156
2
0
20 Oct 2023
Frozen Transformers in Language Models Are Effective Visual Encoder
  Layers
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
Ziqi Pang
Ziyang Xie
Yunze Man
Yu-Xiong Wang
431
49
0
19 Oct 2023
Evaluating the Fairness of Discriminative Foundation Models in Computer
  Vision
Evaluating the Fairness of Discriminative Foundation Models in Computer VisionAAAI/ACM Conference on AI, Ethics, and Society (AIES), 2023
Junaid Ali
Matthäus Kleindessner
F. Wenzel
Kailash Budhathoki
Volkan Cevher
Chris Russell
VLM
248
15
0
18 Oct 2023
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and
  Gallery Banks
Balance Act: Mitigating Hubness in Cross-Modal Retrieval with Query and Gallery BanksConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yimu Wang
Xiangru Jian
Bo Xue
204
22
0
17 Oct 2023
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang
Hao Zhang
Feng Li
Xueyan Zou
Chun-yue Li
Jianfeng Gao
MLLMVLM
447
269
0
17 Oct 2023
NICE: Improving Panoptic Narrative Detection and Segmentation with
  Cascading Collaborative Learning
NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative LearningIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Haowei Wang
Jiayi Ji
Tianyu Guo
Yilong Yang
Weihao Ye
Xiaoshuai Sun
Rongrong Ji
353
8
0
17 Oct 2023
MiniGPT-v2: large language model as a unified interface for
  vision-language multi-task learning
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen
Deyao Zhu
Xiaoqian Shen
Xiang Li
Zechun Liu
Pengchuan Zhang
Raghuraman Krishnamoorthi
Vikas Chandra
Yunyang Xiong
Mohamed Elhoseiny
MLLM
1.5K
631
0
14 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
  Models
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLMVLM
412
72
0
13 Oct 2023
Incremental Object Detection with CLIP
Incremental Object Detection with CLIP
Ziyue Huang
Yupeng He
Qingjie Liu
Yunhong Wang
CLLObjDVLM
299
2
0
13 Oct 2023
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Ferret: Refer and Ground Anything Anywhere at Any GranularityInternational Conference on Learning Representations (ICLR), 2023
Haoxuan You
Haotian Zhang
Zhe Gan
Xianzhi Du
Bowen Zhang
Zirui Wang
Liangliang Cao
Shih-Fu Chang
Yinfei Yang
ObjDMLLMVLM
421
455
0
11 Oct 2023
VeCLIP: Improving CLIP Training via Visual-enriched Captions
VeCLIP: Improving CLIP Training via Visual-enriched CaptionsEuropean Conference on Computer Vision (ECCV), 2023
Zhengfeng Lai
Haotian Zhang
Bowen Zhang
Wentao Wu
Haoping Bai
...
Zhe Gan
Jiulong Shan
Chen-Nee Chuah
Yinfei Yang
Meng Cao
CLIPVLM
365
60
0
11 Oct 2023
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
TextPSG: Panoptic Scene Graph Generation from Textual DescriptionsIEEE International Conference on Computer Vision (ICCV), 2023
Chengyang Zhao
Songlin Yang
Zhenfang Chen
Mingyu Ding
Chuang Gan
393
23
0
10 Oct 2023
InstructDET: Diversifying Referring Object Detection with Generalized
  Instructions
InstructDET: Diversifying Referring Object Detection with Generalized InstructionsInternational Conference on Learning Representations (ICLR), 2023
Ronghao Dang
Jiangyan Feng
Haodong Zhang
Chongjian Ge
Lin Song
...
Chengju Liu
Qi Chen
Feng Zhu
Rui Zhao
Yibing Song
ObjD
441
16
0
08 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
144
5
0
08 Oct 2023
Envisioning Narrative Intelligence: A Creative Visual Storytelling
  Anthology
Envisioning Narrative Intelligence: A Creative Visual Storytelling AnthologyInternational Conference on Human Factors in Computing Systems (CHI), 2023
Brett A. Halperin
S. Lukin
CoGe
214
30
0
06 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
  Models
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2023
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
282
20
0
04 Oct 2023
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
  Prediction
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense PredictionInternational Conference on Learning Representations (ICLR), 2023
Size Wu
Wenwei Zhang
Lumin Xu
Sheng Jin
Xiangtai Li
Wentao Liu
Chen Change Loy
CLIPVLM
250
104
0
02 Oct 2023
Towards reporting bias in visual-language datasets: bimodal augmentation
  by decoupling object-attribute association
Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association
Qiyu Wu
Mengjie Zhao
Yutong He
Lang Huang
Junya Ono
Hiromi Wakaki
Yuki Mitsufuji
298
6
0
02 Oct 2023
Understanding Transferable Representation Learning and Zero-shot
  Transfer in CLIP
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIPInternational Conference on Learning Representations (ICLR), 2023
Zixiang Chen
Yihe Deng
Yuanzhi Li
Quanquan Gu
VLM
395
18
0
02 Oct 2023
Pink: Unveiling the Power of Referential Comprehension for Multi-modal
  LLMs
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMsComputer Vision and Pattern Recognition (CVPR), 2023
Shiyu Xuan
Qingpei Guo
Ming Yang
Shiliang Zhang
MLLMObjD
270
52
0
01 Oct 2023
Black-box Attacks on Image Activity Prediction and its Natural Language
  Explanations
Black-box Attacks on Image Activity Prediction and its Natural Language Explanations
Alina Elena Baia
Valentina Poggioni
Andrea Cavallaro
AAML
227
1
0
30 Sep 2023
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
Region-centric Image-Language Pretraining for Open-Vocabulary DetectionEuropean Conference on Computer Vision (ECCV), 2023
Dahun Kim
A. Angelova
Weicheng Kuo
ObjDVLM
257
6
0
29 Sep 2023
Retail-786k: a Large-Scale Dataset for Visual Entity Matching
Retail-786k: a Large-Scale Dataset for Visual Entity Matching
Bianca Lamm
Janis Keuper
VLM
243
4
0
29 Sep 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
328
22
0
23 Sep 2023
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight
  Inheritance
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight InheritanceIEEE International Conference on Computer Vision (ICCV), 2023
Kan Wu
Houwen Peng
Zhenghong Zhou
Bin Xiao
Xiyang Dai
...
Xi
Xi Chen
Xinggang Wang
Hongyang Chao
Han Hu
VLMOODD
257
97
0
21 Sep 2023
Multi3DRefer: Grounding Text Description to Multiple 3D Objects
Multi3DRefer: Grounding Text Description to Multiple 3D ObjectsIEEE International Conference on Computer Vision (ICCV), 2023
Yiming Zhang
ZeMing Gong
Angel X. Chang
397
134
0
11 Sep 2023
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image
  Captioning
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image CaptioningInternational Conference on Language Resources and Evaluation (LREC), 2023
Guisheng Liu
Yi Li
Zhengcong Fei
Haiyan Fu
Xiangyang Luo
Yanqing Guo
VLMDiffM
265
16
0
10 Sep 2023
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual
  Tokenization
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual TokenizationInternational Conference on Learning Representations (ICLR), 2023
Yang Jin
Kun Xu
Kun Xu
Liwei Chen
Chao Liao
...
Xiaoqiang Lei
Chen Zhang
Wenwu Ou
Kun Gai
Yadong Mu
MLLMVLM
257
77
0
09 Sep 2023
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding
Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual GroundingEuropean Conference on Computer Vision (ECCV), 2023
Ozan Unal
Daniel Gehrig
Suman Saha
Luc Van Gool
288
29
0
08 Sep 2023
Previous
123...111213...252627
Next
Page 12 of 27
Pageof 27