Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1504.00325
Cited By
v1
v2 (latest)
Microsoft COCO Captions: Data Collection and Evaluation Server
1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Microsoft COCO Captions: Data Collection and Evaluation Server"
50 / 1,519 papers shown
BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
European Conference on Computer Vision (ECCV), 2023
Rizhao Cai
Zirui Song
Dayan Guan
Zhenhao Chen
Xing Luo
Chenyu Yi
Alex C. Kot
MLLM
VLM
319
44
0
05 Dec 2023
Object Recognition as Next Token Prediction
Computer Vision and Pattern Recognition (CVPR), 2023
Kaiyu Yue
Borchun Chen
Jonas Geiping
Hengduo Li
Tom Goldstein
Ser-Nam Lim
507
12
0
04 Dec 2023
A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Keito Kudo
Haruki Nagasawa
Jun Suzuki
Nobuyuki Shimizu
249
5
0
04 Dec 2023
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Andrés Villa
Juan Carlos León Alcázar
Alvaro Soto
Bernard Ghanem
MLLM
VLM
292
19
0
03 Dec 2023
Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?
Weisong Sun
Chunrong Fang
Yun Miao
Yudu You
Mengzhe Yuan
...
Quanjun Zhang
An Guo
Xiang Chen
Yang Liu
Zhenyu Chen
289
16
0
01 Dec 2023
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation
Rongyao Fang
Shilin Yan
Zhaoyang Huang
Jingqiu Zhou
Hao Tian
Jifeng Dai
Jiaming Song
MLLM
213
16
0
30 Nov 2023
TLDR: Text Based Last-layer Retraining for Debiasing Image Classifiers
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Juhyeon Park
Seokhyeon Jeong
Taesup Moon
273
2
0
30 Nov 2023
Understanding and Improving In-Context Learning on Vision-language Models
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Shuo Chen
Zhen Han
Bailan He
Mark Buckley
Juil Sock
Volker Tresp
Jindong Gu
203
2
0
29 Nov 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
European Conference on Computer Vision (ECCV), 2023
Shicheng Li
Lei Li
Shuhuai Ren
Yuanxin Liu
Yi Liu
Rundong Gao
Xu Sun
Lu Hou
227
49
0
29 Nov 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Computer Vision and Pattern Recognition (CVPR), 2023
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Raviteja Vemulapalli
Oncel Tuzel
CLIP
VLM
692
84
0
28 Nov 2023
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
European Conference on Computer Vision (ECCV), 2023
Yanwei Li
Chengyao Wang
Jiaya Jia
VLM
MLLM
333
480
0
28 Nov 2023
Large Language Models Meet Computer Vision: A Brief Survey
Raby Hamadi
LM&MA
150
5
0
28 Nov 2023
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
European Conference on Computer Vision (ECCV), 2023
Chenglin Yang
Siyuan Qiao
Yuan Cao
Yu Zhang
Tao Zhu
Yaoyao Liu
Jiahui Yu
VLM
163
3
0
27 Nov 2023
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Munan Ning
Bin Zhu
Yujia Xie
Bin Lin
Jiaxi Cui
Lu Yuan
Dongdong Chen
Li-ming Yuan
ELM
MLLM
213
91
0
27 Nov 2023
Fully Authentic Visual Question Answering Dataset from Online Communities
European Conference on Computer Vision (ECCV), 2023
Chongyan Chen
Xiyang Dai
Noel Codella
Yunsheng Li
Lu Yuan
Danna Gurari
373
9
0
27 Nov 2023
Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Yunxin Li
Zhenyu Liu
Wei Wang
Xiaochun Cao
Yuxin Ding
Xiaochun Cao
Min Zhang
181
6
0
27 Nov 2023
Large Language Models as Automated Aligners for benchmarking Vision-Language Models
Yuanfeng Ji
Chongjian Ge
Weikai Kong
Enze Xie
Zhengying Liu
Zhengguo Li
Ping Luo
MLLM
ELM
209
10
0
24 Nov 2023
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
European Conference on Computer Vision (ECCV), 2023
Yufei Zhan
Yousong Zhu
Zhiyang Chen
Fan Yang
E. Goles
Jinqiao Wang
ObjD
242
30
0
24 Nov 2023
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
European Conference on Computer Vision (ECCV), 2023
Lin Chen
Jinsong Li
Xiao-wen Dong
Pan Zhang
Conghui He
Yuan Liu
Feng Zhao
Dahua Lin
MLLM
VLM
380
936
0
21 Nov 2023
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
European Conference on Computer Vision (ECCV), 2023
Meng Chu
Zhedong Zheng
Wei Ji
Tingyu Wang
Tat-Seng Chua
276
25
0
21 Nov 2023
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Gongwei Chen
Leyang Shen
Rui Shao
Xiang Deng
Liqiang Nie
VLM
MLLM
302
83
0
20 Nov 2023
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention
Zuyao Chen
Jinlin Wu
Zhen Lei
Zhaoxiang Zhang
Changwen Chen
302
29
0
18 Nov 2023
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
Shelly Sheynin
Adam Polyak
Uriel Singer
Yuval Kirstain
Amit Zohar
Oron Ashual
Devi Parikh
Yaniv Taigman
220
238
0
16 Nov 2023
Towards Open-Ended Visual Recognition with Large Language Model
Qihang Yu
Xiaohui Shen
Liang-Chieh Chen
VLM
246
8
0
14 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2023
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
512
353
0
14 Nov 2023
Detecting and Correcting Hate Speech in Multimodal Memes with Large Visual Language Model
Minh-Hao Van
Xintao Wu
VLM
MLLM
209
15
0
12 Nov 2023
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Computer Vision and Pattern Recognition (CVPR), 2023
Bin Xiao
Haiping Wu
Weijian Xu
Xiyang Dai
Houdong Hu
Yumao Lu
Michael Zeng
Ce Liu
Lu Yuan
VLM
398
393
0
10 Nov 2023
Training CLIP models on Data from Scientific Papers
Calvin Metzger
VLM
CLIP
122
3
0
08 Nov 2023
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Zhen Yang
Yingxue Zhang
Fandong Meng
Jie Zhou
VLM
MLLM
216
4
0
08 Nov 2023
OtterHD: A High-Resolution Multi-modality Model
Yue Liu
Peiyuan Zhang
Jingkang Yang
Yuanhan Zhang
Fanyi Pu
Ziwei Liu
VLM
MLLM
190
77
0
07 Nov 2023
MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept Acquisition
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Guangyue Xu
Parisa Kordjamshidi
Joyce Chai
162
2
0
02 Nov 2023
De-Diffusion Makes Text a Strong Cross-Modal Interface
Computer Vision and Pattern Recognition (CVPR), 2023
Chen Wei
Chenxi Liu
Siyuan Qiao
Zhishuai Zhang
Alan Yuille
Jiahui Yu
VLM
DiffM
274
17
0
01 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Information Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
402
72
0
01 Nov 2023
CapsFusion: Rethinking Image-Text Data at Scale
Computer Vision and Pattern Recognition (CVPR), 2023
Qiying Yu
Quan-Sen Sun
Xiaosong Zhang
Yufeng Cui
Fan Zhang
Yue Cao
Xinlong Wang
Jingjing Liu
VLM
370
88
0
31 Oct 2023
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Deepanway Ghosal
Navonil Majumder
Roy Ka-wei Lee
Amélie Reymond
Soujanya Poria
156
16
0
31 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
350
3
0
30 Oct 2023
Impressions: Understanding Visual Semiotics and Aesthetic Impact
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Julia Kruk
Caleb Ziems
Diyi Yang
157
3
0
27 Oct 2023
CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
Neural Information Processing Systems (NeurIPS), 2023
Chuofan Ma
Yi Jiang
Xin Wen
Zehuan Yuan
Xiaojuan Qi
ObjD
VLM
260
70
0
25 Oct 2023
Knowledge Editing for Large Language Models: A Survey
ACM Computing Surveys (ACM Comput. Surv.), 2023
Song Wang
Yaochen Zhu
Haochen Liu
Zaiyi Zheng
Chen Chen
Wenlin Yao
KELM
455
202
0
24 Oct 2023
Leveraging Image-Text Similarity and Caption Modification for the DataComp Challenge: Filtering Track and BYOD Track
Shuhei Yokoo
Peifei Zhu
Yuchi Ishikawa
Mikihiro Tanaka
Masayoshi Kondo
Hirokatsu Kataoka
87
1
0
23 Oct 2023
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding
Chunlei Wang
Wenquan Feng
Xiangtai Li
Guangliang Cheng
Shuchang Lyu
Binghao Liu
Lijiang Chen
Qi Zhao
ObjD
VLM
269
14
0
22 Oct 2023
ITEm: Unsupervised Image-Text Embedding Learning for eCommerce
Baohao Liao
Michael Kozielski
Sanjika Hewavitharana
Jiangbo Yuan
Shahram Khadivi
Tomer Lancewicki
SSL
132
0
0
22 Oct 2023
On the Transferability of Visually Grounded PCFGs
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yanpeng Zhao
Ivan Titov
144
1
0
21 Oct 2023
On the Language Encoder of Contrastive Cross-modal Models
Mengjie Zhao
Junya Ono
Zhi-Wei Zhong
Chieh-Hsin Lai
Yuhta Takida
Naoki Murata
Wei-Hsiang Liao
Takashi Shibuya
Hiromi Wakaki
Yuki Mitsufuji
VLM
145
2
0
20 Oct 2023
PrivImage: Differentially Private Synthetic Image Generation using Diffusion Models with Semantic-Aware Pretraining
Kecen Li
Chen Gong
Zhixiang Li
Yuzhong Zhao
Xinwen Hou
Tianhao Wang
358
21
0
19 Oct 2023
InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions
Hanbo Zhang
Jie Xu
Yuchen Mo
Tao Kong
192
2
0
18 Oct 2023
LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation
Computer Vision and Pattern Recognition (CVPR), 2023
Kibum Kim
Kanghoon Yoon
Jaeyeong Jeon
Yeonjun In
Jinyoung Moon
Donghyun Kim
Chanyoung Park
537
30
0
16 Oct 2023
Bounding and Filling: A Fast and Flexible Framework for Image Captioning
Zheng Ma
Changxin Wang
Bo Huang
Zi-Yue Zhu
Jianbing Zhang
187
3
0
15 Oct 2023
Leveraging Image Augmentation for Object Manipulation: Towards Interpretable Controllability in Object-Centric Learning
Jinwoo Kim
Janghyuk Choi
Jaehyun Kang
Changyeon Lee
Ho-Jin Choi
Seon Joo Kim
OCL
401
1
0
13 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLM
VLM
411
70
0
13 Oct 2023
Previous
1
2
3
...
10
11
12
...
29
30
31
Next
Page 11 of 31
Page
of 31
Go