ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.04870
  4. Cited By
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
ArXiv (abs)PDFHTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,318 papers shown
Title
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Guanqi Zhan
Yuanpei Liu
Kai Han
Weidi Xie
Andrew Zisserman
VLM
1.0K
0
0
21 Feb 2025
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank AdaptationInternational Conference on Multimedia Retrieval (ICMR), 2024
Yuheng Ji
Yue Liu
Zhicheng Zhang
Zhao Zhang
Yuting Zhao
Gang Zhou
Xingwei Zhang
Xinwang Liu
Xiaolong Zheng
VLM
334
4
0
21 Feb 2025
LOVA3: Learning to Visual Question Answering, Asking and Assessment
LOVA3: Learning to Visual Question Answering, Asking and AssessmentNeural Information Processing Systems (NeurIPS), 2024
Henry Hengyuan Zhao
Pan Zhou
Difei Gao
Zechen Bai
Mike Zheng Shou
358
13
0
21 Feb 2025
Contrastive Localized Language-Image Pre-Training
Contrastive Localized Language-Image Pre-Training
Hong-You Chen
Zhengfeng Lai
Hao Zhang
Xiang Wang
Marcin Eichner
Keen You
Meng Cao
Bowen Zhang
Yue Yang
Zhe Gan
CLIPVLM
271
22
0
20 Feb 2025
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
Henry Hengyuan Zhao
Wenqi Pei
Yifei Tao
Haiyang Mei
Mike Zheng Shou
388
0
0
20 Feb 2025
Megrez-Omni Technical Report
Boxun Li
Yadong Li
Hui Yuan
Congyi Liu
Weilin Liu
...
Dong Zhou
Yueqing Zhuang
Shengen Yan
Guohao Dai
Longji Xu
179
1
0
19 Feb 2025
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
L. Yang
Xinchen Zhang
Ye Tian
Chenming Shang
Minghao Xu
Wentao Zhang
Tengjiao Wang
296
9
0
17 Feb 2025
How Blind and Low-Vision Individuals Prefer Large Vision-Language Model-Generated Scene Descriptions
How Blind and Low-Vision Individuals Prefer Large Vision-Language Model-Generated Scene Descriptions
Na Min An
Eunki Kim
Wan Ju Kang
Sangryul Kim
Hyunjung Shim
Hyunjung Shim
240
2
0
15 Feb 2025
Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Timo Fudala
Vasileios Tsouvalas
N. Meratnia
MoE
204
0
0
10 Feb 2025
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality InversionInternational Conference on Learning Representations (ICLR), 2025
Marco Mistretta
Alberto Baldrati
Lorenzo Agnolucci
Marco Bertini
Andrew D. Bagdanov
CLIPVLM
387
14
0
06 Feb 2025
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
H. Malik
Fahad Shamshad
Muzammal Naseer
Karthik Nandakumar
Fahad Shahbaz Khan
Salman Khan
AAMLMLLMVLM
378
5
0
03 Feb 2025
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Shenghao Fu
Q. Yang
Qijie Mo
Junkai Yan
Xihan Wei
Jingke Meng
Xiaohua Xie
Wei-Shi Zheng
MLLMObjDVLM
335
27
0
31 Jan 2025
Fine Tuning without Catastrophic Forgetting via Selective Low Rank Adaptation
Reza Akbarian Bafghi
Carden Bagwell
Avinash Ravichandran
Ashish Shrivastava
M. Raissi
204
4
0
28 Jan 2025
Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
Ahmad Süleyman
Göksel Biricik
343
3
0
15 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with LanguageComputer Vision and Pattern Recognition (CVPR), 2023
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Yuan Liu
Kaipeng Zhang
Dahua Lin
Yu Qiao
Shiyang Feng
Xiangyu Yue
MLLM
484
188
0
10 Jan 2025
Classifier-Guided Captioning Across ModalitiesIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Ariel Shaulov
Tal Shaharabany
E. Shaar
Gal Chechik
Lior Wolf
185
0
0
03 Jan 2025
Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression ComprehensionAAAI Conference on Artificial Intelligence (AAAI), 2025
Yaxian Wang
Henghui Ding
Shuting He
Xudong Jiang
Bifan Wei
Jun Liu
ObjD
205
7
0
03 Jan 2025
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image CaptioningEuropean Conference on Computer Vision (ECCV), 2024
Jianjie Luo
Jingwen Chen
Yehao Li
Yingwei Pan
Jianlin Feng
Hongyang Chao
Ting Yao
DiffMVLM
225
1
0
03 Jan 2025
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding TasksInternational Conference on Learning Representations (ICLR), 2024
Ziyan Jiang
Rui Meng
Xinyi Yang
Semih Yavuz
Yingbo Zhou
Lei Ma
MLLMVLM
452
89
0
03 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksNeural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLMVLMLRM
647
113
0
03 Jan 2025
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, EditingNeural Information Processing Systems (NeurIPS), 2024
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
407
70
0
31 Dec 2024
ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers
ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers
Chao Fan
Qipei Mei
Xiaonan Wang
Xinming Li
134
4
0
31 Dec 2024
Towards Visual Grounding: A Survey
Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
739
26
0
28 Dec 2024
To Predict or Not To Predict? Proportionally Masked Autoencoders for
  Tabular Data Imputation
To Predict or Not To Predict? Proportionally Masked Autoencoders for Tabular Data Imputation
Jungkyu Kim
Kibok Lee
Taeyoung Park
296
3
0
26 Dec 2024
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang
Yanzhao Zhang
Wen Xie
Mingxin Li
Ziqi Dai
Dingkun Long
Pengjun Xie
Meishan Zhang
Wenjie Li
Hao Fei
374
65
0
22 Dec 2024
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level
  Vision-Language Alignment
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language AlignmentComputer Vision and Pattern Recognition (CVPR), 2024
Cijo Jose
Théo Moutakanni
Dahyun Kang
Federico Baldassarre
Timothée Darcet
...
Maxime Oquab
Oriane Siméoni
Huy V. Vo
Patrick Labatut
Piotr Bojanowski
CLIPVLM
268
30
0
20 Dec 2024
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
Zhiqiang Tang
Zihan Zhong
Tong He
Gerald Friedland
323
4
0
19 Dec 2024
I0T: Embedding Standardization Method Towards Zero Modality Gap
I0T: Embedding Standardization Method Towards Zero Modality GapAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Na Min An
Eunki Kim
James Thorne
Hyunjung Shim
VLM
297
2
0
18 Dec 2024
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
Yipeng Zhang
Yi Liu
Zonghao Guo
Yidan Zhang
Xuesong Yang
...
Xingtai Lv
Zhiyuan Liu
Tat-Seng Chua
Maosong Sun
Maosong Sun
MLLMVLM
291
3
0
18 Dec 2024
M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
M3^33-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object SegmentationComputer Vision and Pattern Recognition (CVPR), 2024
Zixuan Chen
Jiaxin Li
Liming Tan
Yejie Guo
Junxuan Liang
Cewu Lu
Yongqian Li
VOS
321
0
0
18 Dec 2024
FLAIR: VLM with Fine-grained Language-informed Image Representations
FLAIR: VLM with Fine-grained Language-informed Image RepresentationsComputer Vision and Pattern Recognition (CVPR), 2024
Rui Xiao
Sanghwan Kim
Mariana-Iuliana Georgescu
Zeynep Akata
Stephan Alaniz
VLMCLIP
268
17
0
04 Dec 2024
DIR: Retrieval-Augmented Image Captioning with Comprehensive
  Understanding
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding
Hao Wu
Zhihang Zhong
Xiao Sun
DiffM
198
1
0
02 Dec 2024
CIA: Controllable Image Augmentation Framework Based on Stable Diffusion
CIA: Controllable Image Augmentation Framework Based on Stable DiffusionConference on Multimedia Information Processing and Retrieval (MIPR), 2024
Mohamed Benkedadra
Dany Rimez
Tiffanie Godelaine
Natarajan Chidambaram
Hamed Razavi Khosroshahi
Horacio Tellez
Matei Mancas
Benoît Macq
Sidi Ahmed Mahmoudi
DiffM
206
2
0
25 Nov 2024
IterIS: Iterative Inference-Solving Alignment for LoRA Merging
IterIS: Iterative Inference-Solving Alignment for LoRA MergingComputer Vision and Pattern Recognition (CVPR), 2024
Hongxu Chen
Runshi Li
Bowei Zhu
Zhen Wang
Long Chen
MoMe
344
4
0
21 Nov 2024
AI-generated Image Detection: Passive or Watermark?
AI-generated Image Detection: Passive or Watermark?
Moyang Guo
Yuepeng Hu
Zhengyuan Jiang
Zeyu Li
Amir Sadovnik
Arka Daw
Neil Zhenqiang Gong
392
2
0
20 Nov 2024
Joint Vision-Language Social Bias Removal for CLIP
Joint Vision-Language Social Bias Removal for CLIPComputer Vision and Pattern Recognition (CVPR), 2024
Haoyu Zhang
Yangyang Guo
Mohan S. Kankanhalli
VLM
375
9
0
19 Nov 2024
SoK: The Security-Safety Continuum of Multimodal Foundation Models through Information Flow and Global Game-Theoretic Analysis of Asymmetric Threats
Ruoxi Sun
Jiamin Chang
Hammond Pearce
Chaowei Xiao
B. Li
Qi Wu
Surya Nepal
Minhui Xue
577
0
0
17 Nov 2024
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding
  Conversations
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations
Jianfeng Chi
Ujjwal Karn
Hongyuan Zhan
Eric Michael Smith
Javier Rando
Yiming Zhang
Kate Plawiak
Zacharie Delpierre Coudert
Kartikeya Upasani
Mahesh Pasupuleti
MLLM3DH
222
75
0
15 Nov 2024
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment
  in Multi-Modal Models
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
Wei Wang
Hao Sun
Qi Xu
Linfeng Li
Yiqing Cai
Botian Jiang
Hang Song
Xingcan Hu
Pengyu Wang
Li Xiao
144
7
0
14 Nov 2024
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference
  Understanding
AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding
Hao Guo
Wei Fan
Baichun Wei
Jianfei Zhu
Jin Tian
Chunzhi Yi
Feng Jiang
201
0
0
13 Nov 2024
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with
  Captions in 28 Languages
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 LanguagesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Youssef Mohamed
Runjia Li
Ibrahim Said Ahmad
Kilichbek Haydarov
Juil Sock
Kenneth Church
Mohamed Elhoseiny
VLM
155
15
0
06 Nov 2024
HumanVLM: Foundation for Human-Scene Vision-Language Model
HumanVLM: Foundation for Human-Scene Vision-Language ModelInformation Fusion (Inf. Fusion), 2024
Dawei Dai
Xu Long
Li Yutang
Zhang YuanHui
Shuyin Xia
VLMMLLM
275
6
0
05 Nov 2024
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability
  Vision-Language Attack
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language AttackIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Yang Liu
Sensen Gao
Qing Guo
Ke Ma
Yihao Huang
Simeng Qin
Yang Liu
Ivor Tsang Fellow
Xiaochun Cao
AAML
167
7
0
04 Nov 2024
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
  Vision-Language Negatives
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language NegativesNeural Information Processing Systems (NeurIPS), 2024
Maitreya Patel
Abhiram Kusumba
Sheng Cheng
Changhoon Kim
Tejas Gokhale
Chitta Baral
Yezhou Yang
CoGeCLIP
251
34
0
04 Nov 2024
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
Anurag Bagchi
Zhipeng Bao
Yu-Xiong Wang
P. Tokmakov
Martial Hebert
VOS
215
2
0
30 Oct 2024
Preserving Pre-trained Representation Space: On Effectiveness of
  Prefix-tuning for Large Multi-modal Models
Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Donghoon Kim
Gusang Lee
Kyuhong Shim
B. Shim
242
5
0
29 Oct 2024
ChatSearch: a Dataset and a Generative Retrieval Model for General
  Conversational Image Retrieval
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image RetrievalPattern Recognition (Pattern Recogn.), 2024
Zijia Zhao
Longteng Guo
Tongtian Yue
Erdong Hu
Shuai Shao
Zehuan Yuan
Hua Huang
Qingbin Liu
133
4
0
24 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5%
  Parameters and 90% Performance
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
319
82
0
21 Oct 2024
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large
  Multimodal Models
Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Fan Yang
Ming Tang
Jinqiao Wang
MLLM
219
3
0
21 Oct 2024
Test-time Adaptation for Cross-modal Retrieval with Query Shift
Test-time Adaptation for Cross-modal Retrieval with Query Shift
Haobin Li
Peng Hu
Qianjun Zhang
Xi Peng
Xiting Liu
Mouxing Yang
TTA
244
8
0
21 Oct 2024
Previous
123456...252627
Next