Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1504.00325
Cited By
v1
v2 (latest)
Microsoft COCO Captions: Data Collection and Evaluation Server
1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Microsoft COCO Captions: Data Collection and Evaluation Server"
50 / 1,519 papers shown
Toward Interactive Regional Understanding in Vision-Large Language Models
Jungbeom Lee
Sanghyuk Chun
Sangdoo Yun
VLM
305
4
0
27 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith C.H. Ngai
433
9
0
21 Mar 2024
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
Pablo Marcos-Manchón
Roberto Alcover-Couso
Juan C. Sanmiguel
Jose M. Martínez
VLM
294
29
0
21 Mar 2024
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
Junho Kim
Yeonju Kim
Yonghyun Ro
LRM
MLLM
211
9
0
20 Mar 2024
As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?
Anjun Hu
Jindong Gu
Francesco Pinto
Konstantinos Kamnitsas
Juil Sock
AAML
SILM
261
9
0
19 Mar 2024
A Survey on Quality Metrics for Text-to-Image Generation
IEEE Transactions on Visualization and Computer Graphics (TVCG), 2024
Sebastian Hartwig
Dominik Engel
Leon Sick
H. Kniesel
Tristan Payer
Poonam Poonam
Michael Glockler
Alex Bauerle
Timo Ropinski
EGVM
300
0
0
18 Mar 2024
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
Guohao Sun
Can Qin
Jiamian Wang
Zeyuan Chen
Ran Xu
Zhiqiang Tao
MLLM
VLM
LRM
288
22
0
17 Mar 2024
LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival
Yuanxin Zhao
Mi Zhang
Bingnan Yang
Zhan Zhang
Jiaju Kang
Jianya Gong
192
5
0
16 Mar 2024
Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval
Shunsuke Tsubaki
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Keisuke Imoto
237
1
0
16 Mar 2024
Generative Region-Language Pretraining for Open-Ended Object Detection
Computer Vision and Pattern Recognition (CVPR), 2024
Chuang Lin
Yi Jiang
Zhuang Li
Zehuan Yuan
Jianfei Cai
ObjD
VLM
224
27
0
15 Mar 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Brandon McKinzie
Zhe Gan
J. Fauconnier
Sam Dodge
Bowen Zhang
...
Zirui Wang
Ruoming Pang
Peter Grasch
Alexander Toshev
Yinfei Yang
MLLM
524
246
0
14 Mar 2024
Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing
European Conference on Computer Vision (ECCV), 2024
Wonjun Kang
Kevin Galim
Hyung Il Koo
DiffM
239
9
0
14 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language Interface
European Conference on Computer Vision (ECCV), 2024
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Jiaming Song
Bernt Schiele
Liwei Wang
VLM
280
22
0
14 Mar 2024
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan
Yousong Zhu
Hongyin Zhao
Fan Yang
Fan Yang
Jinqiao Wang
Jinqiao Wang
ObjD
294
26
0
14 Mar 2024
DAM: Dynamic Adapter Merging for Continual Video QA Learning
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Feng Cheng
Ziyang Wang
Yi-Lin Sung
Yan-Bo Lin
Mohit Bansal
Gedas Bertasius
CLL
MoMe
367
18
0
13 Mar 2024
An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model
IEEE International Conference on Multimedia and Expo (ICME), 2024
Yuxin Tian
Mouxing Yang
Yunfan Li
Dayiheng Liu
Xingzhang Ren
Xiaocui Peng
Jiancheng Lv
VLM
161
1
0
13 Mar 2024
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
Computer Vision and Pattern Recognition (CVPR), 2024
Lei Zhu
Fangyun Wei
Yanye Lu
MLLM
VLM
222
30
0
12 Mar 2024
Synth
2
^2
2
: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
Sahand Sharifzadeh
Christos Kaplanis
Shreya Pathak
D. Kumaran
Anastasija Ilić
Jovana Mitrović
Charles Blundell
Andrea Banino
VLM
238
17
0
12 Mar 2024
Transformer based Multitask Learning for Image Captioning and Object Detection
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2024
Debolena Basak
P. K. Srijith
M. Desarkar
190
3
0
10 Mar 2024
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?
International Conference on Learning Representations (ICLR), 2024
Ibrahim Alabdulmohsin
Xiao Wang
Andreas Steiner
Priya Goyal
Alexander DÁmour
Xiao-Qi Zhai
213
30
0
07 Mar 2024
Popeye: A Unified Visual-Language Model for Multi-Source Ship Detection from Remote Sensing Imagery
Wei Zhang
Miaoxin Cai
Tong Zhang
Guoqiang Lei
Zhuang Yin
Xuerui Mao
211
16
0
06 Mar 2024
Neural Image Compression with Text-guided Encoding for both Pixel-level and Perceptual Fidelity
Hagyeong Lee
Minkyu Kim
Jun-Hyuk Kim
Seungeon Kim
Dokwan Oh
Jaeho Lee
DiffM
247
17
0
05 Mar 2024
When ControlNet Meets Inexplicit Masks: A Case Study of ControlNet on its Contour-following Ability
Wenjie Xuan
Yufei Xu
Shanshan Zhao
Chaoyue Wang
Juhua Liu
Bo Du
Dacheng Tao
220
10
0
01 Mar 2024
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
Zhekai Zhang
Tianle Cai
Jiaxin Cao
Qinsheng Zhang
Han Cai
Junjie Bai
Yangqing Jia
Ming-Yu Liu
Kai Li
Song Han
DiffM
417
99
0
29 Feb 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
318
86
0
29 Feb 2024
SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model
Bin Cao
Jianhao Yuan
Yexin Liu
Jian Li
Shuyang Sun
Jing Liu
Bo Zhao
DiffM
286
13
0
28 Feb 2024
Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction
Koki Maeda
Shuhei Kurita
Taiki Miyanishi
Naoaki Okazaki
225
6
0
28 Feb 2024
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
291
4
0
27 Feb 2024
MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning
Huiyu Xiong
Lanxiao Wang
Heqian Qiu
Taijin Zhao
Benliu Qiu
Hongliang Li
CLL
223
1
0
27 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
324
9
0
27 Feb 2024
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Yang Liu
Xiaomin Yu
Gongyu Zhang
Christos Bergeles
Prokar Dasgupta
Alejandro Granados
Sebastien Ourselin
214
3
0
27 Feb 2024
Towards Open-ended Visual Quality Comparison
Haoning Wu
Hanwei Zhu
Zicheng Zhang
Erli Zhang
Chaofeng Chen
...
Qiong Yan
Xiaohong Liu
Guangtao Zhai
Shiqi Wang
Weisi Lin
AAML
245
91
0
26 Feb 2024
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Santiago Castro
Amir Ziai
Avneesh Saluja
Zhuoning Yuan
Amélie Reymond
MLLM
CoGe
VLM
232
8
0
22 Feb 2024
Vision-Language Navigation with Embodied Intelligence: A Survey
Peng Gao
Peng Wang
Feng Gao
Haiwei Yang
Ruyue Yuan
LM&Ro
357
8
0
22 Feb 2024
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
Jiawei Liang
Yaning Tan
Man Luo
Aishan Liu
Dongchen Han
Ee-Chien Chang
Xiaochun Cao
278
72
0
21 Feb 2024
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
Ziyue Wang
Chi Chen
Zihao Wan
Zhaolu Kang
Qidong Yan
...
Xiaoyue Mi
Peng Li
Ning Ma
Maosong Sun
Yang Liu
314
11
0
21 Feb 2024
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation
Yunxin Li
Baotian Hu
Tong Lu
Lin Ma
Yuxin Ding
Min Zhang
245
4
0
21 Feb 2024
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang
Mu Cai
Tengyang Xie
Yong Jae Lee
LRM
390
33
0
20 Feb 2024
ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
Li Mi
Syrielle Montariol
J. Castillo-Navarro
Xianjie Dai
Antoine Bosselut
D. Tuia
177
7
0
20 Feb 2024
Language-guided Image Reflection Separation
Haofeng Zhong
Yuchen Hong
Shuchen Weng
Jinxiu Liang
Boxin Shi
270
21
0
19 Feb 2024
Interpretable Embedding for Ad-hoc Video Search
Jiaxin Wu
Chong-Wah Ngo
177
32
0
19 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
243
4
0
18 Feb 2024
Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability
Yejun Yoon
Seunghyun Yoon
Kunwoo Park
321
1
0
17 Feb 2024
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
Junfei Xiao
Zheng Xu
Yaoyao Liu
Shen Yan
Boyu Wang
238
6
0
16 Feb 2024
Recovering the Pre-Fine-Tuning Weights of Generative Models
Eliahu Horwitz
Jonathan Kahana
Yedid Hoshen
260
12
0
15 Feb 2024
Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community
Arman Isajanyan
Artur Shatveryan
David Kocharyan
Zinan Lin
Humphrey Shi
EGVM
232
8
0
15 Feb 2024
DoRA: Weight-Decomposed Low-Rank Adaptation
Shih-yang Liu
Chien-Yi Wang
Hongxu Yin
Pavlo Molchanov
Yu-Chiang Frank Wang
Kwang-Ting Cheng
Min-Hung Chen
774
676
0
14 Feb 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu
Tian-Xin Li
Quanfeng Lu
Wenqi Shao
Junjun He
Yu Qiao
Ping Luo
ELM
LM&MA
331
135
0
14 Feb 2024
Visually Dehallucinative Instruction Generation
Sungguk Cha
Jusung Lee
Younghyun Lee
Cheoljong Yang
MLLM
92
8
0
13 Feb 2024
A Benchmark for Multi-modal Foundation Models on Low-level Vision: from Single Images to Pairs
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Zicheng Zhang
Haoning Wu
Erli Zhang
Guangtao Zhai
Weisi Lin
VLM
166
8
0
11 Feb 2024
Previous
1
2
3
...
8
9
10
...
29
30
31
Next
Page 9 of 31
Page
of 31
Go