ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1505.04870
  4. Cited By
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for
  Richer Image-to-Sentence Models
v1v2v3v4 (latest)

Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
ArXiv (abs)PDFHTML

Papers citing "Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"

50 / 1,325 papers shown
Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales
Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales
Minghe Gao
Shuang Chen
Liang Pang
Xingtai Lv
Jisheng Dang
Wenqiao Zhang
Juncheng Li
Siliang Tang
Yueting Zhuang
Tat-Seng Chua
LRM
161
10
0
17 Apr 2024
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal
Michael Wray
Dima Damen
219
10
0
15 Apr 2024
RankCLIP: Ranking-Consistent Language-Image Pretraining
RankCLIP: Ranking-Consistent Language-Image Pretraining
Yiming Zhang
Zhuokai Zhao
Zhaorun Chen
Zhili Feng
Zenghui Ding
Yining Sun
SSLVLM
405
9
0
15 Apr 2024
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Generalized Contrastive Learning for Multi-Modal Retrieval and Ranking
Tianyu Zhu
M. Jung
Jesse Clark
430
4
0
12 Apr 2024
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
  Language Models
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Haotian Zhang
Haoxuan You
Philipp Dufter
Bowen Zhang
Chen Chen
...
Tsu-Jui Fu
William Y. Wang
Shih-Fu Chang
Zhe Gan
Yinfei Yang
ObjDMLLM
285
84
0
11 Apr 2024
How is Visual Attention Influenced by Text Guidance? Database and Model
How is Visual Attention Influenced by Text Guidance? Database and Model
Yinan Sun
Xiongkuo Min
Huiyu Duan
Guangtao Zhai
260
7
0
11 Apr 2024
To Cool or not to Cool? Temperature Network Meets Large Foundation
  Models via DRO
To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO
Zi-Hao Qiu
Siqi Guo
Mao Xu
Tuo Zhao
Lijun Zhang
Tianbao Yang
AI4TSAI4CE
288
10
0
06 Apr 2024
Vision Transformers in Domain Adaptation and Generalization: A Study of
  Robustness
Vision Transformers in Domain Adaptation and Generalization: A Study of Robustness
Shadi Alijani
Jamil Fayyad
Homayoun Najjaran
OOD
314
1
0
05 Apr 2024
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt
  Coherence Metrics with T2IScoreScore (TS2)
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
Michael Stephen Saxon
Fatima Jahara
Mahsa Khoshnoodi
Yujie Lu
Aditya Sharma
William Y. Wang
EGVM
304
12
0
05 Apr 2024
Would Deep Generative Models Amplify Bias in Future Models?
Would Deep Generative Models Amplify Bias in Future Models?Computer Vision and Pattern Recognition (CVPR), 2024
Tianwei Chen
Yusuke Hirota
Mayu Otani
Noa Garcia
Yuta Nakashima
216
23
0
04 Apr 2024
Text-driven Affordance Learning from Egocentric Vision
Text-driven Affordance Learning from Egocentric Vision
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
286
6
0
03 Apr 2024
Rethinking Pruning for Vision-Language Models: Strategies for Effective
  Sparsity and Performance Restoration
Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration
Shwai He
Ang Li
Tianlong Chen
VLM
285
3
0
03 Apr 2024
CosmicMan: A Text-to-Image Foundation Model for Humans
CosmicMan: A Text-to-Image Foundation Model for Humans
Shikai Li
Jianglin Fu
Kaiyuan Liu
Wentao Wang
Kwan-Yee Lin
Wayne Wu
DiffM
284
34
0
01 Apr 2024
Constructing Multilingual Visual-Text Datasets Revealing Visual
  Multilingual Ability of Vision Language Models
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
Jesse Atuhurra
Iqra Ali
Tatsuya Hiraoka
Hidetaka Kamigaito
Tomoya Iwakura
Taro Watanabe
244
1
0
29 Mar 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin
Xinyu Wei
Ruichuan An
Shiyang Feng
Bocheng Zou
Yulin Luo
Siyuan Huang
Shanghang Zhang
Jiaming Song
VLM
402
84
0
29 Mar 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLMLRM
307
77
0
28 Mar 2024
J-CRe3: A Japanese Conversation Dataset for Real-world Reference
  Resolution
J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution
Nobuhiro Ueda
Hideko Habe
Yoko Matsui
Akishige Yuguchi
Seiya Kawano
Yasutomo Kawanishi
Sadao Kurohashi
Koichiro Yoshino
185
8
0
28 Mar 2024
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive
  Dataset and Benchmark for Chain-of-Thought Reasoning
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Hao Shao
Shengju Qian
Han Xiao
Guanglu Song
Zhuofan Zong
Letian Wang
Yu Liu
Jiaming Song
VGenLRMMLLM
346
216
0
25 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video
  Understanding
InternVideo2: Scaling Video Foundation Models for Multimodal Video UnderstandingEuropean Conference on Computer Vision (ECCV), 2024
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
262
104
0
22 Mar 2024
Not All Attention is Needed: Parameter and Computation Efficient
  Transfer Learning for Multi-modal Large Language Models
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu
Yiyi Zhou
Weihao Ye
Xiaoshuai Sun
Rongrong Ji
MoE
188
2
0
22 Mar 2024
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guan-Feng Wang
Long Bai
Wan Jun Nah
Jie Wang
Zhaoxi Zhang
Zhen Chen
Jinlin Wu
Mobarakol Islam
Hongbin Liu
Hongliang Ren
350
28
0
22 Mar 2024
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
Zheng Zhang
Yeyao Ma
Enming Zhang
Xiang Bai
VLMMLLM
285
79
0
21 Mar 2024
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large
  Vision Language Models
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
Tongtian Yue
Jie Cheng
Longteng Guo
Xingyuan Dai
Zijia Zhao
Xingjian He
Gang Xiong
Yisheng Lv
Jing Liu
216
13
0
20 Mar 2024
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and
  mmWave Radar
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar
Runwei Guan
Liye Jia
Fengyufan Yang
Shanliang Yao
Erick Purwanto
...
Eng Gee Lim
Jeremy S. Smith
Ka Lok Man
Xuming Hu
Yutao Yue
378
19
0
19 Mar 2024
Boosting Transferability in Vision-Language Attacks via Diversification
  along the Intersection Region of Adversarial Trajectory
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
Sensen Gao
Yang Liu
Xuhong Ren
Ivor Tsang
Qing Guo
AAML
337
30
0
19 Mar 2024
A Survey on Quality Metrics for Text-to-Image Generation
A Survey on Quality Metrics for Text-to-Image GenerationIEEE Transactions on Visualization and Computer Graphics (TVCG), 2024
Sebastian Hartwig
Dominik Engel
Leon Sick
H. Kniesel
Tristan Payer
Poonam Poonam
Michael Glockler
Alex Bauerle
Timo Ropinski
EGVM
298
0
0
18 Mar 2024
Improving Adversarial Transferability of Vision-Language Pre-training
  Models through Collaborative Multimodal Interaction
Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction
Jiyuan Fu
Zhaoyu Chen
Kaixun Jiang
Haijing Guo
Jiafeng Wang
Shuyong Gao
Wenqiang Zhang
VLMAAML
171
7
0
16 Mar 2024
Mitigating Dialogue Hallucination for Large Vision Language Models via
  Adversarial Instruction Tuning
Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning
Dongmin Park
Zhaofang Qian
Guangxing Han
Ser-Nam Lim
MLLM
261
1
0
15 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language
  Interface
GiT: Towards Generalist Vision Transformer through Universal Language InterfaceEuropean Conference on Computer Vision (ECCV), 2024
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Jiaming Song
Bernt Schiele
Liwei Wang
VLM
280
22
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Fan Yang
Jinqiao Wang
Jinqiao Wang
ObjD
294
26
0
14 Mar 2024
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with
  Module-wise Pruning Error Metric
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error MetricComputer Vision and Pattern Recognition (CVPR), 2024
Haokun Lin
Haoli Bai
Zhili Liu
Lu Hou
Muyi Sun
Linqi Song
Ying Wei
Zhenan Sun
CLIPVLM
165
35
0
12 Mar 2024
Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and
  Image Embeddings
Synth2^22: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
Sahand Sharifzadeh
Christos Kaplanis
Shreya Pathak
D. Kumaran
Anastasija Ilić
Jovana Mitrović
Charles Blundell
Andrea Banino
VLM
238
17
0
12 Mar 2024
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
  Acceleration for Large Vision-Language Models
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language ModelsEuropean Conference on Computer Vision (ECCV), 2024
Liang Chen
Haozhe Zhao
Tianyu Liu
Shuai Bai
Junyang Lin
Chang Zhou
Baobao Chang
MLLMVLM
342
333
0
11 Mar 2024
VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object
  Detection via Vision-Language Model
VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model
Junsu Kim
Yunhoe Ku
Jihyeon Kim
Junuk Cha
Seungryul Baek
ObjDVLM
330
24
0
08 Mar 2024
Effectiveness Assessment of Recent Large Vision-Language Models
Effectiveness Assessment of Recent Large Vision-Language Models
Yao Jiang
Xinyu Yan
Ge-Peng Ji
Keren Fu
Meijun Sun
Huan Xiong
Deng-Ping Fan
Fahad Shahbaz Khan
535
36
0
07 Mar 2024
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual
  Grounding
MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding
Chun-Peng Chang
Shaoxiang Wang
A. Pagani
Didier Stricker
360
23
0
05 Mar 2024
Detecting Concrete Visual Tokens for Multimodal Machine Translation
Detecting Concrete Visual Tokens for Multimodal Machine Translation
Braeden Bowen
Vipin Vijayan
Scott Grigsby
Timothy Anderson
Jeremy Gwinnup
260
5
0
05 Mar 2024
Adding Multimodal Capabilities to a Text-only Translation Model
Adding Multimodal Capabilities to a Text-only Translation Model
Vipin Vijayan
Braeden Bowen
Scott Grigsby
Timothy Anderson
Jeremy Gwinnup
LRM
276
10
0
05 Mar 2024
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception
Jun-Yan He
Yifan Wang
Lijun Wang
Huchuan Lu
Jun-Yan He
Jinpeng Lan
Bin Luo
Xuansong Xie
MLLMVLM
229
35
0
05 Mar 2024
Differentially Private Representation Learning via Image Captioning
Differentially Private Representation Learning via Image Captioning
Tom Sander
Yaodong Yu
Maziar Sanjabi
Alain Durmus
Yi-An Ma
Kamalika Chaudhuri
Chuan Guo
279
7
0
04 Mar 2024
Contrastive Region Guidance: Improving Grounding in Vision-Language
  Models without Training
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
David Wan
Jaemin Cho
Elias Stengel-Eskin
Mohit Bansal
VLMObjD
316
49
0
04 Mar 2024
Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models
Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models
Meiling Li
Zhenxing Qian
Xinpeng Zhang
295
2
0
03 Mar 2024
Can Transformers Capture Spatial Relations between Objects?
Can Transformers Capture Spatial Relations between Objects?
Chuan Wen
Dinesh Jayaraman
Yang Gao
ViT
178
8
0
01 Mar 2024
Improving Explicit Spatial Relationships in Text-to-Image Generation
  through an Automatically Derived Dataset
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset
Ander Salaberria
Gorka Azkune
Oier López de Lacalle
A. Soroa
Eneko Agirre
Frank Keller
EGVM
199
3
0
01 Mar 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the
  Open World
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
318
85
0
29 Feb 2024
How to Understand "Support"? An Implicit-enhanced Causal Inference
  Approach for Weakly-supervised Phrase Grounding
How to Understand "Support"? An Implicit-enhanced Causal Inference Approach for Weakly-supervised Phrase Grounding
Jiamin Luo
Jianing Zhao
Jingjing Wang
Guodong Zhou
234
0
0
29 Feb 2024
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
Thong Nguyen
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
190
15
0
27 Feb 2024
Probing Multimodal Large Language Models for Global and Local Semantic
  Representations
Probing Multimodal Large Language Models for Global and Local Semantic Representations
Mingxu Tao
Quzhe Huang
Kun Xu
Liwei Chen
Yansong Feng
Dongyan Zhao
322
10
0
27 Feb 2024
Measuring Vision-Language STEM Skills of Neural Models
Measuring Vision-Language STEM Skills of Neural Models
Jianhao Shen
Ye Yuan
Srbuhi Mirzoyan
Ming Zhang
Chenguang Wang
VLM
426
13
0
27 Feb 2024
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang
Ziqiao Ma
Xiaofeng Gao
Suhaila Shakiah
Qiaozi Gao
Joyce Chai
MLLMVLM
376
74
0
26 Feb 2024
Previous
123...8910...252627
Next