ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.04870
  4. Cited By
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
ArXiv (abs)PDFHTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown
Technical Report on Text Dataset Distillation
Technical Report on Text Dataset Distillation
Keith Ando Ogawa
Bruno Yamamoto
Lucas Lauton de Alcantara
Victor Zacarias
Edson Bollis
Lucas Pellicer
Rosimeire Pereira Costa
A. H. R. Costa
Artur Jordao
DD
284
0
0
03 Dec 2025
Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction
Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction
Rui Fonseca
Bruno Martins
Gil Rocha
VLM
115
0
0
03 Dec 2025
Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Juexi Shao
Siyou Li
Yujian Gan
Chris Madge
Vanja Karan
Massimo Poesio
148
0
0
02 Dec 2025
SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning
SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioningAAAI Conference on Artificial Intelligence (AAAI), 2025
Xu Zhang
Jin Yuan
Hanwang Zhang
Guojin Zhong
Yongsheng Zang
Jiacheng Lin
Zhiyong Li
DiffMVLM
136
1
0
01 Dec 2025
Hierarchical Semantic Alignment for Image Clustering
Xingyu Zhu
B. Zhu
Yunfan Li
Junfeng Fang
Shuo Wang
Kesen Zhao
Hanwang Zhang
67
0
0
30 Nov 2025
Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior
Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior
Ruoyu Feng
Y. Qi
Jinming Liu
Yixin Gao
Xin Li
Xin Jin
Zhibo Chen
DiffM
117
0
0
27 Nov 2025
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Yunze Man
S. S. Wang
Guowen Zhang
Johan Bjorck
Zhiqi Li
Liang-Yan Gui
Jim Fan
Jan Kautz
Yu Wang
Zhiding Yu
126
0
0
25 Nov 2025
Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs
Z. J. Wang
Chang Che
Qi Wang
Hui Ma
Zenglin Shi
Cees G. M. Snoek
Meng Wang
CLL
209
0
0
25 Nov 2025
Online-PVLM: Advancing Personalized VLMs with Online Concept Learning
Online-PVLM: Advancing Personalized VLMs with Online Concept Learning
Huiyu Bai
Runze Wang
Zhuoyun Du
Yiyang Zhao
Fengji Zhang
H. Chen
Xiaoyong Zhu
Bo Zheng
Xuejiao Zhao
105
0
0
25 Nov 2025
VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning
VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning
Lingxiao Li
Y. Wang
Xinyan Gao
Chen Tang
Xiangyu Yue
Chenyu You
LRM
80
1
0
21 Nov 2025
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Ziyan Liu
Y. Chen
Hongyi Cai
Tao Lin
Shuo Yang
Zheng Liu
Bo Zhao
VLM
323
0
0
20 Nov 2025
PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation
PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person GenerationInformation Fusion (Inf. Fusion), 2025
Ting Pan
Ye Wang
Peiguang Jing
Rui Ma
Zili Yi
Y. Liu
265
0
0
20 Nov 2025
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Multimodal Continual Instruction Tuning with Dynamic Gradient Guidance
Songze Li
Mingyu Gao
Tonghua Su
Xu-Yao Zhang
Zhongjie Wang
CLL
332
0
0
19 Nov 2025
CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product
CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product
Kaiwen Xue
Chenglong Li
Zhonghong Ou
Guoxin Zhang
Kaoyan Lu
...
Xinyu Liu
Qunlin Chen
Weiwei Qin
Yiran Shen
Jiayi Cen
129
0
0
17 Nov 2025
Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks
Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks
Minsoo Jo
Dongyoon Yang
Taesup Kim
AAML
167
0
0
17 Nov 2025
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Yunxin Li
Xinyu Chen
Shenyuan Jiang
Haoyuan Shi
Zhenyu Liu
...
Zhenran Xu
Yicheng Ma
Meishan Zhang
Baotian Hu
Min Zhang
MLLMMoEOSLMVLM
625
1
0
16 Nov 2025
An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
Georgios Pantazopoulos
Eda B. Özyiğit
LRM
355
0
0
11 Nov 2025
Surprisal reveals diversity gaps in image captioning and different scorers change the story
Surprisal reveals diversity gaps in image captioning and different scorers change the story
N. Ilinykh
Simon Dobnik
81
0
0
06 Nov 2025
Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization
Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization
Zhicheng Wang
Chen Ju
X. Chen
Shuai Xiao
Jinsong Lan
Xiaoyong Zhu
Ying Chen
Zhiguo Cao
216
0
0
03 Nov 2025
Enhancing Adversarial Transferability in Visual-Language Pre-training Models via Local Shuffle and Sample-based Attack
Enhancing Adversarial Transferability in Visual-Language Pre-training Models via Local Shuffle and Sample-based AttackNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Xin Liu
Aoyang Zhou
Aoyang Zhou
AAML
117
0
0
02 Nov 2025
From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection
From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection
Mengfei Liang
Y. Qu
Yukun Jiang
Michael Backes
Yang Zhang
193
0
0
31 Oct 2025
Masked Diffusion Captioning for Visual Feature Learning
Masked Diffusion Captioning for Visual Feature Learning
Chao Feng
Zihao Wei
Andrew Owens
DiffM
266
0
0
30 Oct 2025
Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Sukrit Sriratanawilai
Jhayahgrit Thongwat
Romrawin Chumpu
Patomporn Payoungkhamdee
Sarana Nutanong
Peerat Limkonchotiwat
VLM
159
0
0
30 Oct 2025
Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Zhi-Kai Chen
Jun-Peng Jiang
Han-Jia Ye
De-Chuan Zhan
138
1
0
29 Oct 2025
DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts
DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts
Binbin Li
Guimiao Yang
Zisen Qi
Haiping Wang
Yu Ding
VLM
337
0
0
28 Oct 2025
T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning
T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning
Julie Mordacq
David Loiseaux
Vicky Kalogeiton
S. Oudot
145
0
0
27 Oct 2025
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Ge Zheng
Jiaye Qian
Jiajin Tang
Sibei Yang
100
6
0
23 Oct 2025
StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
Jiho Park
Sieun Choi
Jaeyoon Seo
Jihie Kim
DiffM
126
0
0
23 Oct 2025
CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
Yongmin Lee
Hye Won Chung
149
0
0
21 Oct 2025
ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Xiaoxing Hu
Kaicheng Yang
Ziyang Gong
Qi Ming
Zonghao Guo
Xiang An
Ziyong Feng
Junchi Yan
Xue Yang
CLIPVLM
234
1
0
21 Oct 2025
Foundation and Large-Scale AI Models in Neuroscience: A Comprehensive Review
Foundation and Large-Scale AI Models in Neuroscience: A Comprehensive Review
Shihao Yang
Xiying Huang
Danilo Bernardo
J. Ding
Andrew Michael
Jingmei Yang
Patrick Kwan
Ashish Raj
Feng Liu
AI4CE
159
1
0
18 Oct 2025
Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity
Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity
Naoki Yoshida
Satoshi Hayakawa
Yuhta Takida
Toshimitsu Uesaka
Hiromi Wakaki
Yuki Mitsufuji
133
1
0
17 Oct 2025
Spatial Preference Rewarding for MLLMs Spatial Understanding
Spatial Preference Rewarding for MLLMs Spatial Understanding
Han Qiu
Peng Gao
Lewei Lu
Xiaoqin Zhang
Ling Shao
Shijian Lu
LRM
147
0
0
16 Oct 2025
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Gabriel Fiastre
Antoine Yang
Cordelia Schmid
VOS
454
1
0
16 Oct 2025
Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
Improving Visual Recommendation on E-commerce Platforms Using Vision-Language ModelsACM Conference on Recommender Systems (RecSys), 2025
Yuki Yada
Sho Akiyama
Ryo Watanabe
Yuta Ueno
Yusuke Shido
Andre Rusli
VLM
84
1
0
15 Oct 2025
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Tiancheng Gu
Kaicheng Yang
Kaichen Zhang
Xiang An
Ziyong Feng
Y. Zhang
Weidong Cai
Jiankang Deng
Lidong Bing
212
8
0
15 Oct 2025
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
Inha Kang
Youngsun Lim
S. Lee
Jiho Choi
Junsuk Choe
Hyunjung Shim
106
0
0
15 Oct 2025
Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications
Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications
Belkiss Souayed
Sarah Ebling
Yingqiang Gao
96
0
0
13 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
151
1
0
12 Oct 2025
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Weikai Huang
Jieyu Zhang
Taoyang Jia
Chenhao Zheng
Ziqi Gao
J. S. Park
Winson Han
Ranjay Krishna
226
0
0
10 Oct 2025
PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
PHyCLIP: ℓ1\ell_1ℓ1​-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Daiki Yoshikawa
Takashi Matsubara
CoGe
200
0
0
10 Oct 2025
Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Mitchell Keren Taraday
Shahaf Wagner
Chaim Baskin
VLM
121
1
0
08 Oct 2025
Think Then Embed: Generative Context Improves Multimodal Embedding
Think Then Embed: Generative Context Improves Multimodal Embedding
Xuanming Cui
Jianpeng Cheng
Hong-you Chen
Satya Narayan Shukla
Abhijeet Awasthi
...
S. W. D. Lim
Qi Guo
Ser-Nam Lim
Aashu Singh
Xiangjun Fan
MLLMLRM
378
4
0
06 Oct 2025
Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
Leander Girrbach
Stephan Alaniz
Genevieve Smith
Trevor Darrell
Zeynep Akata
205
3
0
04 Oct 2025
Referring Expression Comprehension for Small Objects
Referring Expression Comprehension for Small Objects
Kanoko Goto
Takumi Hirose
Mahiro Ukai
Shuhei Kurita
Nakamasa Inoue
ObjD
147
1
0
04 Oct 2025
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Lorenzo Bianchi
Giacomo Pacini
F. Carrara
Nicola Messina
Giuseppe Amato
Fabrizio Falchi
VLM
179
0
0
03 Oct 2025
CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
Qihua Dong
Luis Figueroa
Handong Zhao
Kushal Kafle
Jason Kuen
Zhihong Ding
Scott D. Cohen
Y. Fu
ObjDLRM
200
0
0
03 Oct 2025
Multi-Objective Task-Aware Predictor for Image-Text Alignment
Multi-Objective Task-Aware Predictor for Image-Text Alignment
Eunki Kim
Na Min An
James Thorne
Hyunjung Shim
137
0
0
01 Oct 2025
ModernVBERT: Towards Smaller Visual Document Retrievers
ModernVBERT: Towards Smaller Visual Document Retrievers
Paul Teiletche
Quentin Macé
Max Conti
António Loison
Gautier Viaud
Pierre Colombo
Manuel Faysse
VLM
313
3
0
01 Oct 2025
Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding
Haotian Xue
Yunhao Ge
Y. Zeng
Zhaoshuo Li
Ming-Yu Liu
Yongxin Chen
JiaoJiao Fan
141
1
0
30 Sep 2025
1234...252627
Next
Page 1 of 27
Pageof 27