Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1504.00325
Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Microsoft COCO Captions: Data Collection and Evaluation Server"
50 / 1,387 papers shown
Title
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval
Yabing Wang
Shuhui Wang
Hao Luo
Jianfeng Dong
F. Wang
Meng Han
Xun Wang
Meng Wang
24
9
0
11 Sep 2023
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
Zhengxiang Shi
Aldo Lipani
VLM
28
30
0
11 Sep 2023
ImageBind-LLM: Multi-modality Instruction Tuning
Jiaming Han
Renrui Zhang
Wenqi Shao
Peng Gao
Peng-Tao Xu
...
Yafei Wen
Xiaoxin Chen
Xiangyu Yue
Hongsheng Li
Yu Qiao
MLLM
49
116
0
07 Sep 2023
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners
Clarence Lee
M Ganesh Kumar
Cheston Tan
28
3
0
07 Sep 2023
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
L. Yu
Bowen Shi
Ramakanth Pasunuru
Benjamin Muller
O. Yu. Golovneva
...
Yaniv Taigman
Maryam Fazel-Zarandi
Asli Celikyilmaz
Luke Zettlemoyer
Armen Aghajanyan
MLLM
38
135
0
05 Sep 2023
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning
Wei Suo
Mengyang Sun
Weisong Liu
Yi-Meng Gao
Peifeng Wang
Yanning Zhang
Qi Wu
LRM
35
7
0
05 Sep 2023
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
Taehoon Kim
Pyunghwan Ahn
Sangyun Kim
Sihaeng Lee
Mark A Marsden
...
Yujin Wang
Yimu Wang
Tiancheng Gu
Xingchang Lv
Mingmao Sun
VLM
17
4
0
05 Sep 2023
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation
Ryota Yoshihashi
Yuya Otsuka
Kenji Doi
Tomohiro Tanaka
Hirokatsu Kataoka
31
1
0
04 Sep 2023
Contrastive Feature Masking Open-Vocabulary Vision Transformer
Dahun Kim
A. Angelova
Weicheng Kuo
ObjD
VLM
23
27
0
02 Sep 2023
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
Ziyu Guo
Renrui Zhang
Xiangyang Zhu
Yiwen Tang
Xianzheng Ma
...
Ke Chen
Peng Gao
Xianzhi Li
Hongsheng Li
Pheng-Ann Heng
MLLM
35
125
0
01 Sep 2023
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
24
8
0
01 Sep 2023
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding
Joshua Forster Feinglass
Yezhou Yang
21
2
0
01 Sep 2023
TouchStone: Evaluating Vision-Language Models by Language Models
Shuai Bai
Shusheng Yang
Jinze Bai
Peng Wang
Xing Zhang
Junyang Lin
Xinggang Wang
Chang Zhou
Jingren Zhou
MLLM
37
44
0
31 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
32
5
0
30 Aug 2023
Evaluation and Analysis of Hallucination in Large Vision-Language Models
Junyan Wang
Yi Zhou
Guohai Xu
Pengcheng Shi
Chenlin Zhao
...
Mingshi Yan
Ji Zhang
Jihua Zhu
Jitao Sang
Haoyu Tang
MLLM
21
66
0
29 Aug 2023
AI-Generated Content (AIGC) for Various Data Modalities: A Survey
Lin Geng Foo
Hossein Rahmani
Xiaozhong Liu
78
31
0
27 Aug 2023
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning
Bang-ju Yang
Fenglin Liu
X. Wu
Yaowei Wang
Xu Sun
Yuexian Zou
VLM
CLIP
44
13
0
25 Aug 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLM
VLM
ObjD
50
796
0
24 Aug 2023
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
Ziyan Yang
Kushal Kafle
Zhe-nan Lin
Scott D. Cohen
Zhihong Ding
Vicente Ordonez
23
1
0
24 Aug 2023
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks
Zichao Dong
Weikun Zhang
Xufeng Huang
Hang Ji
Xin Zhan
Junbo Chen
VLM
19
4
0
24 Aug 2023
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D
Shuhei Kurita
Naoki Katsura
Eri Onami
EgoV
26
14
0
23 Aug 2023
VQA Therapy: Exploring Answer Differences by Visually Grounding Answers
Chongyan Chen
Samreen Anjum
Danna Gurari
23
10
0
21 Aug 2023
Explore and Tell: Embodied Visual Captioning in 3D Environments
Anwen Hu
Shizhe Chen
Liang Zhang
Qin Jin
LM&Ro
32
2
0
21 Aug 2023
Generic Attention-model Explainability by Weighted Relevance Accumulation
Yiming Huang
Ao Jia
Xiaodan Zhang
Jiawei Zhang
18
1
0
20 Aug 2023
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
Zi-Yuan Hu
Yanyang Li
M. Lyu
Liwei Wang
VLM
27
15
0
18 Aug 2023
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan
Shiwei Zhang
Xiang Wang
Samuel Albanie
Yining Pan
Tao Feng
Jianwen Jiang
Dong Ni
Yingya Zhang
Deli Zhao
VLM
21
38
0
18 Aug 2023
Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage
Dario Cioni
Lorenzo Berlincioni
Federico Becattini
A. Bimbo
DiffM
24
9
0
14 Aug 2023
Foundation Model is Efficient Multimodal Multitask Model Selector
Fanqing Meng
Wenqi Shao
Zhanglin Peng
Chong Jiang
Kaipeng Zhang
Yu Qiao
Ping Luo
30
13
0
11 Aug 2023
Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination
Haoxuan Li
Yi Bin
Junrong Liao
Yang Yang
Heng Tao Shen
25
30
0
08 Aug 2023
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
Yi Bin
Haoxuan Li
Yahui Xu
Xing Xu
Yang Yang
Heng Tao Shen
VOS
24
18
0
08 Aug 2023
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
Wenqi Shao
Yutao Hu
Peng Gao
Meng Lei
Kaipeng Zhang
...
Peng-Tao Xu
Siyuan Huang
Hongsheng Li
Yuning Qiao
Ping Luo
VLM
MLLM
32
2
0
07 Aug 2023
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding
Jiazhou Zhou
Xueye Zheng
Yuanhuiyi Lyu
Lin Wang
VLM
24
16
0
06 Aug 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
45
607
0
04 Aug 2023
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
Qihang Yu
Ju He
XueQing Deng
Xiaohui Shen
Liang-Chieh Chen
VLM
CLIP
36
136
0
04 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
48
84
0
03 Aug 2023
Reverse Stable Diffusion: What prompt was used to generate this image?
Florinel-Alin Croitoru
Vlad Hondru
Radu Tudor Ionescu
M. Shah
VLM
DiffM
39
6
0
02 Aug 2023
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla
Irena Gao
Josh Gardner
Jack Hessel
Yusuf Hanafy
...
Simon Kornblith
Pang Wei Koh
Gabriel Ilharco
Mitchell Wortsman
Ludwig Schmidt
MLLM
33
403
0
02 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIP
VLM
27
10
0
02 Aug 2023
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning
Junjie Fei
Teng Wang
Jinrui Zhang
Zhenyu He
Chengjie Wang
Feng Zheng
VLM
28
34
0
31 Jul 2023
Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation
Zhiyuan Li
Dongnan Liu
Heng Wang
Chaoyi Zhang
Weidong (Tom) Cai
RALM
39
0
0
27 Jul 2023
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Kun Yuan
V. Srivastav
Tong Yu
Joël L. Lavanchy
Pietro Mascagni
Pietro Mascagni
N. Padoy
Nicolas Padoy
35
20
0
27 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
38
118
0
25 Jul 2023
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
Jingxuan Wei
Cheng Tan
Zhangyang Gao
Linzhuang Sun
Siyuan Li
Bihui Yu
R. Guo
Stan Z. Li
LRM
34
8
0
24 Jul 2023
Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering
Xinyue Hu
Lin Gu
Qi A. An
Mengliang Zhang
Liangchen Liu
Kazuma Kobayashi
Tatsuya Harada
Ronald M. Summers
Yingying Zhu
MedIm
20
22
0
22 Jul 2023
OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?
Runjia Li
Shuyang Sun
Mohamed Elhoseiny
Philip H. S. Torr
23
11
0
21 Jul 2023
Divide & Bind Your Attention for Improved Generative Semantic Nursing
Yumeng Li
M. Keuper
Dan Zhang
Anna Khoreva
DiffM
25
47
0
20 Jul 2023
Reference-based Painterly Inpainting via Diffusion: Crossing the Wild Reference Domain Gap
Dejia Xu
Xingqian Xu
Wenyan Cong
Humphrey Shi
Zhangyang Wang
DiffM
29
4
0
20 Jul 2023
Improving Multimodal Datasets with Image Captioning
Thao Nguyen
S. Gadre
Gabriel Ilharco
Sewoong Oh
Ludwig Schmidt
VLM
19
70
0
19 Jul 2023
Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning
Zijie Song
Zhenzhen Hu
Yuanen Zhou
Ye Zhao
Richang Hong
Meng Wang
21
2
0
19 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Chaoyang Zhu
Long Chen
ObjD
VLM
31
32
0
18 Jul 2023
Previous
1
2
3
...
9
10
11
...
26
27
28
Next