Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1504.00325
Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Microsoft COCO Captions: Data Collection and Evaluation Server"
50 / 1,387 papers shown
Title
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi-Xin Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
37
17
0
25 Dec 2023
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
Xinyan Chen
Jiaxin Ge
Tianjun Zhang
Jiaming Liu
Shanghang Zhang
VLM
EGVM
34
0
0
23 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
176
924
0
21 Dec 2023
Generative Multimodal Models are In-Context Learners
Quan-Sen Sun
Yufeng Cui
Xiaosong Zhang
Fan Zhang
Qiying Yu
...
Yueze Wang
Yongming Rao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
LRM
45
246
0
20 Dec 2023
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
16
7
0
19 Dec 2023
CLIM: Contrastive Language-Image Mosaic for Region Representation
Size Wu
Wenwei Zhang
Lumin Xu
Sheng Jin
Wentao Liu
Chen Change Loy
ObjD
VLM
52
15
0
18 Dec 2023
M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts
Mingsheng Li
Xin Chen
C. Zhang
Sijin Chen
Hongyuan Zhu
Fukun Yin
Gang Yu
Tao Chen
28
24
0
17 Dec 2023
Simple Image-level Classification Improves Open-vocabulary Object Detection
Ru Fang
Guansong Pang
Xiaolong Bai
ObjD
VLM
50
14
0
16 Dec 2023
Tell Me What You See: Text-Guided Real-World Image Denoising
E. Yosef
Raja Giryes
DiffM
59
2
0
15 Dec 2023
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
Jinguo Zhu
Xiaohan Ding
Yixiao Ge
Yuying Ge
Sijie Zhao
Hengshuang Zhao
Xiaohua Wang
Ying Shan
ViT
VLM
13
32
0
14 Dec 2023
Pixel Aligned Language Models
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLM
VLM
45
14
0
14 Dec 2023
CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer
Yabing Wang
Fan Wang
Jianfeng Dong
Hao Luo
VLM
24
9
0
14 Dec 2023
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models
Zhiyuan You
Zheyuan Li
Jinjin Gu
Zhenfei Yin
Tianfan Xue
Chao Dong
EGVM
21
35
0
14 Dec 2023
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning
Zhiyue Liu
Jinyuan Liu
Fanrong Ma
CLIP
VLM
27
10
0
14 Dec 2023
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator
Henry Hengyuan Zhao
Pan Zhou
Mike Zheng Shou
MLLM
SyDa
38
7
0
11 Dec 2023
Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods
Panos Achlioptas
Alexandros Benetatos
Iordanis Fostiropoulos
Dimitris Skourtis
21
8
0
11 Dec 2023
GlitchBench: Can large multimodal models detect video game glitches?
Mohammad Reza Taesiri
Tianjun Feng
Anh Nguyen
C. Bezemer
MLLM
VLM
LRM
30
9
0
08 Dec 2023
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects
Junyu Lu
Ruyi Gan
Di Zhang
Xiaojun Wu
Ziwei Wu
Renliang Sun
Jiaxing Zhang
Pingjian Zhang
Yan Song
MLLM
VLM
23
15
0
08 Dec 2023
GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives
Zuyao Chen
Jinlin Wu
Zhen Lei
Zhaoxiang Zhang
Changwen Chen
20
2
0
07 Dec 2023
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu
Sule Bai
Guanbin Li
Yitong Wang
Yansong Tang
VLM
31
28
0
07 Dec 2023
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Rizhao Cai
Zirui Song
Dayan Guan
Zhenhao Chen
Xing Luo
Chenyu Yi
Alex C. Kot
MLLM
VLM
33
31
0
05 Dec 2023
Object Recognition as Next Token Prediction
Kaiyu Yue
Borchun Chen
Jonas Geiping
Hengduo Li
Tom Goldstein
Ser-Nam Lim
40
9
0
04 Dec 2023
A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video
Keito Kudo
Haruki Nagasawa
Jun Suzuki
Nobuyuki Shimizu
40
2
0
04 Dec 2023
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Andrés Villa
Juan Carlos León Alcázar
Alvaro Soto
Bernard Ghanem
MLLM
VLM
24
9
0
03 Dec 2023
Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?
Dongrui Liu
Chunrong Fang
Yun Miao
Yudu You
Mengzhe Yuan
...
Quanjun Zhang
An Guo
Xiang Chen
Yang Liu
Zhenyu Chen
35
5
0
01 Dec 2023
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Rongyao Fang
Shilin Yan
Zhaoyang Huang
Jingqiu Zhou
Hao Tian
Jifeng Dai
Hongsheng Li
MLLM
45
8
0
30 Nov 2023
TLDR: Text Based Last-layer Retraining for Debiasing Image Classifiers
Juhyeon Park
Seokhyeon Jeong
Taesup Moon
35
1
0
30 Nov 2023
Understanding and Improving In-Context Learning on Vision-language Models
Shuo Chen
Zhen Han
Bailan He
Mark Buckley
Philip H. S. Torr
Volker Tresp
Jindong Gu
27
6
0
29 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
34
29
0
29 Nov 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Raviteja Vemulapalli
Oncel Tuzel
CLIP
VLM
31
43
0
28 Nov 2023
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Yanwei Li
Chengyao Wang
Jiaya Jia
VLM
MLLM
38
259
0
28 Nov 2023
Large Language Models Meet Computer Vision: A Brief Survey
Raby Hamadi
LM&MA
23
4
0
28 Nov 2023
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
Chenglin Yang
Siyuan Qiao
Yuan Cao
Yu Zhang
Tao Zhu
Alan L. Yuille
Jiahui Yu
VLM
18
3
0
27 Nov 2023
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Munan Ning
Bin Zhu
Yujia Xie
Bin Lin
Jiaxi Cui
Lu Yuan
Dongdong Chen
Li-ming Yuan
ELM
MLLM
27
58
0
27 Nov 2023
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Yunxin Li
Baotian Hu
Wei Wang
Xiaochun Cao
Min Zhang
16
4
0
27 Nov 2023
Fully Authentic Visual Question Answering Dataset from Online Communities
Chongyan Chen
Mengchen Liu
Noel Codella
Yunsheng Li
Lu Yuan
Danna Gurari
43
5
0
27 Nov 2023
Large Language Models as Automated Aligners for benchmarking Vision-Language Models
Yuanfeng Ji
Chongjian Ge
Weikai Kong
Enze Xie
Zhengying Liu
Zhengguo Li
Ping Luo
MLLM
ELM
34
7
0
24 Nov 2023
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
Yufei Zhan
Yousong Zhu
Zhiyang Chen
Fan Yang
E. Goles
Jinqiao Wang
ObjD
52
14
0
24 Nov 2023
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Conghui He
Jiaqi Wang
Feng Zhao
Dahua Lin
MLLM
VLM
58
582
0
21 Nov 2023
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Meng Chu
Zhedong Zheng
Wei Ji
Tingyu Wang
Tat-Seng Chua
21
9
0
21 Nov 2023
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Gongwei Chen
Leyang Shen
Rui Shao
Xiang Deng
Liqiang Nie
VLM
MLLM
67
42
0
20 Nov 2023
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Zuyao Chen
Jinlin Wu
Zhen Lei
Zhaoxiang Zhang
Changwen Chen
25
11
0
18 Nov 2023
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
Shelly Sheynin
Adam Polyak
Uriel Singer
Yuval Kirstain
Amit Zohar
Oron Ashual
Devi Parikh
Yaniv Taigman
19
129
0
16 Nov 2023
Towards Open-Ended Visual Recognition with Large Language Model
Qihang Yu
Xiaohui Shen
Liang-Chieh Chen
VLM
22
8
0
14 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
36
223
0
14 Nov 2023
Detecting and Correcting Hate Speech in Multimodal Memes with Large Visual Language Model
Minh-Hao Van
Xintao Wu
VLM
MLLM
30
10
0
12 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
36
143
0
10 Nov 2023
Training CLIP models on Data from Scientific Papers
Calvin Metzger
VLM
CLIP
19
1
0
08 Nov 2023
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Zhen Yang
Yingxue Zhang
Fandong Meng
Jie Zhou
VLM
MLLM
42
3
0
08 Nov 2023
OtterHD: A High-Resolution Multi-modality Model
Bo-wen Li
Peiyuan Zhang
Jingkang Yang
Yuanhan Zhang
Fanyi Pu
Ziwei Liu
VLM
MLLM
35
65
0
07 Nov 2023
Previous
1
2
3
...
7
8
9
...
26
27
28
Next