Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
1505.04870
Cited By
v1
v2
v3
v4 (latest)
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
19 May 2015
Bryan A. Plummer
Liwei Wang
Christopher M. Cervantes
Juan C. Caicedo
Anjali Narayan-Chen
Svetlana Lazebnik
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models"
50 / 1,323 papers shown
Title
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Xi Lin
Akshat Shrivastava
Liang Luo
Srinivasan Iyer
Mike Lewis
Gargi Gosh
Luke Zettlemoyer
Armen Aghajanyan
MoE
243
50
0
31 Jul 2024
FiCo-ITR: bridging fine-grained and coarse-grained image-text retrieval for comparative performance analysis
Mikel Williams-Lekuona
Georgina Cosma
206
0
0
29 Jul 2024
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval
Zeyu Chen
Pengfei Zhang
Kai Ye
Wei Dong
Xin Feng
Yana Zhang
195
1
0
28 Jul 2024
Unified Lexical Representation for Interpretable Visual-Language Alignment
Yifan Li
Yikai Wang
Yanwei Fu
Dongyu Ru
Zheng Zhang
Tong He
VLM
189
7
0
25 Jul 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang
Kaixiang Ji
Biao Gong
Zhiwu Qing
Qinglong Zhang
Kecheng Zheng
Jian Wang
Jingdong Chen
Ming Yang
LRM
194
5
0
22 Jul 2024
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection
Kwanyong Park
Kuniaki Saito
Donghyun Kim
VLM
CoGe
213
5
0
21 Jul 2024
Sim-CLIP: Unsupervised Siamese Adversarial Fine-Tuning for Robust and Semantically-Rich Vision-Language Models
Md Zarif Hossain
Ahmed Imteaj
VLM
AAML
173
9
0
20 Jul 2024
Learning Visual Grounding from Generative Vision and Language Model
Shijie Wang
Dahun Kim
A. Taalimi
Chen Sun
Weicheng Kuo
ObjD
257
18
0
18 Jul 2024
X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs
S. Swetha
Jinyu Yang
T. Neiman
Mamshad Nayeem Rizve
Son Tran
Benjamin Z. Yao
Trishul Chilimbi
Mubarak Shah
252
9
0
18 Jul 2024
Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks
Takumi Komatsu
Motonari Kambara
Shumpei Hatanaka
Haruka Matsuo
Tsubasa Hirakawa
Takayoshi Yamashita
H. Fujiyoshi
Komei Sugiura
199
2
0
18 Jul 2024
OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping
Meng Li
Qi Zhao
Shuchang Lyu
Chunlei Wang
Yujing Ma
Guangliang Cheng
Chenguang Yang
282
10
0
18 Jul 2024
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning
Mustafa Dogan
.Ilker Kesen
Iacer Calixto
Aykut Erdem
Erkut Erdem
LRM
248
2
0
17 Jul 2024
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
Naoya Sogi
Takashi Shibata
Makoto Terao
VLM
242
4
0
17 Jul 2024
Controllable Contextualized Image Captioning: Directing the Visual Narrative through User-Defined Highlights
Shunqi Mao
Chaoyi Zhang
Hang Su
Hwanjun Song
Igor Shalyminov
Weidong Cai
291
4
0
16 Jul 2024
Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques
Rishika Bhagwatkar
Shravan Nayak
Reza Bayat
Alexis Roger
Daniel Z Kaplan
P. Bashivan
Irina Rish
AAML
VLM
262
2
0
15 Jul 2024
OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer
Yu Wang
Xiangbo Su
Qiang Chen
Xinyu Zhang
Teng Xi
Kun Yao
Errui Ding
Qiang Chen
Jingdong Wang
ObjD
VLM
92
2
0
15 Jul 2024
Position: Measure Dataset Diversity, Don't Just Claim It
Dora Zhao
Jerone T. A. Andrews
Orestis Papakyriakopoulos
Alice Xiang
267
29
0
11 Jul 2024
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Yatai Ji
Shilong Zhang
Jie Wu
Peize Sun
Weifeng Chen
Xuefeng Xiao
Sidi Yang
Yanting Yang
Ping Luo
VLM
185
6
0
10 Jul 2024
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
Yuxin Chen
Zongyang Ma
Ziqi Zhang
Chen Ma
Chunfeng Yuan
Bing Li
Junfu Pu
Ying Shan
Xiaojuan Qi
Weiming Hu
146
3
0
10 Jul 2024
Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation
Seonghoon Yu
Paul Hongsuck Seo
Jeany Son
DiffM
396
12
0
10 Jul 2024
A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends
Daizong Liu
Mingyu Yang
Xiaoye Qu
Pan Zhou
Yu Cheng
Wei Hu
ELM
AAML
324
69
0
10 Jul 2024
LEMoN: Label Error Detection using Multimodal Neighbors
Haoran Zhang
Aparna Balagopalan
Nassim Oufattole
Hyewon Jeong
Yan Wu
Jiacheng Zhu
Elisa Kreiss
379
2
0
10 Jul 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh
Cheng-Yu Hsieh
Shih-Ying Yeh
Louis Béthune
Hadi Pour Ansari
Pavan Kumar Anasosalu Vasu
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Marco Cuturi
345
7
0
09 Jul 2024
A Single Transformer for Scalable Vision-Language Modeling
Yangyi Chen
Xingyao Wang
Yuan Yao
Heng Ji
LRM
269
29
0
08 Jul 2024
OneDiff: A Generalist Model for Image Difference Captioning
Erdong Hu
Longteng Guo
Tongtian Yue
Zijia Zhao
Xuange Gao
Jing Liu
VLM
484
4
0
08 Jul 2024
MobileFlow: A Multimodal LLM For Mobile GUI Agent
Songqin Nong
Jiali Zhu
Rui Wu
Jiongchao Jin
Shuo Shan
Xiutian Huang
Wenhao Xu
160
23
0
05 Jul 2024
ACTRESS: Active Retraining for Semi-supervised Visual Grounding
Weitai Kang
Mengxue Qu
Yunchao Wei
Yan Yan
302
8
0
03 Jul 2024
Visual Grounding with Attention-Driven Constraint Balancing
Weitai Kang
Luowei Zhou
Junyi Wu
Changchang Sun
Yan Yan
234
9
0
03 Jul 2024
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
Weitai Kang
Gaowen Liu
Mubarak Shah
Yan Yan
ObjD
384
17
0
03 Jul 2024
Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective
Zhaotian Weng
Zijun Gao
Jerone Andrews
Jieyu Zhao
267
9
0
03 Jul 2024
MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Binxu Li
Tiankai Yan
Yuanting Pan
Zhe Xu
Jie Luo
Ruiyang Ji
Shilong Liu
Haoyu Dong
Zihao Lin
Yixin Wang
LM&MA
197
69
0
02 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLM
MLLM
375
22
0
01 Jul 2024
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao
Qiuna Tan
Guanting Dong
Minhui Wu
Chong Sun
...
Yida Xu
Muxi Diao
Zhimin Bao
Chen Li
Honggang Zhang
VLM
LRM
268
157
0
01 Jul 2024
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang
Liping Yuan
Yuchen Zhang
283
112
0
30 Jun 2024
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
Mehar Bhatia
Sahithya Ravi
Aditya Chinchure
EunJeong Hwang
Vered Shwartz
VLM
310
9
0
28 Jun 2024
Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models
Nila Masrourisaadat
Nazanin Sedaghatkish
Fatemeh Sarshartehrani
Edward A. Fox
287
13
0
28 Jun 2024
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Yicheng Chen
Xiangtai Li
Yining Li
Yanhong Zeng
Jianzong Wu
Xiangyu Zhao
Kai Chen
VLM
DiffM
399
3
0
28 Jun 2024
A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)
Daniel Sonntag
Michael Barz
Thiago S. Gouvêa
VLM
244
6
0
27 Jun 2024
ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su
Peihan Miao
Huanzhang Dou
Xi Li
ObjD
243
15
0
26 Jun 2024
Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Sedigheh Eslami
Gerard de Melo
VLM
288
12
0
25 Jun 2024
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Jierun Chen
Fangyun Wei
Jinjing Zhao
Sizhe Song
Bohuai Wu
Zhuoxuan Peng
S.-H. Gary Chan
Hongyang R. Zhang
233
31
0
24 Jun 2024
Review of Zero-Shot and Few-Shot AI Algorithms in The Medical Domain
Maged Badawi
Mohammedyahia Abushanab
Sheethal Bhat
Andreas Maier
VLM
236
4
0
23 Jun 2024
Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
Gregor Geigle
Radu Timofte
Goran Glavaš
246
2
0
20 Jun 2024
Revealing Vision-Language Integration in the Brain with Multimodal Networks
Vighnesh Subramaniam
C. Conwell
Christopher Wang
Gabriel Kreiman
Boris Katz
Ignacio Cases
Andrei Barbu
220
18
0
20 Jun 2024
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
Jie Zhang
Sibo Wang
Xiangkui Cao
Zheng Yuan
Shiguang Shan
Xilin Chen
Wen Gao
VLM
355
18
0
20 Jun 2024
Composing Object Relations and Attributes for Image-Text Matching
Khoi Pham
Chuong Huynh
Ser-Nam Lim
Abhinav Shrivastava
CoGe
229
18
0
17 Jun 2024
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
Yujie Lu
Dongfu Jiang
Wenhu Chen
William Yang Wang
Yejin Choi
Bill Yuchen Lin
VLM
428
55
0
16 Jun 2024
First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models
Enming Zhang
Ruobing Yao
Huanyong Liu
Junhui Yu
Jiale Wang
ELM
LRM
230
3
0
14 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLM
LRM
219
1
0
13 Jun 2024
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations
Ruiyuan Lyu
Tai Wang
Jingli Lin
Shuai Yang
Xiaohan Mao
...
Runsen Xu
Haifeng Huang
Chenming Zhu
Dahua Lin
Jiangmiao Pang
3DV
325
33
0
13 Jun 2024
Previous
1
2
3
...
6
7
8
...
25
26
27
Next