ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXiv (abs)PDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,515 papers shown
Title
Lightweight In-Context Tuning for Multimodal Unified Models
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
116
5
0
08 Oct 2023
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via
  Pre-trained Models
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained ModelsNeural Information Processing Systems (NeurIPS), 2023
Ziyi Yin
Muchao Ye
Tianrong Zhang
Tianyu Du
Jinguo Zhu
Han Liu
Jinghui Chen
Ting Wang
Fenglong Ma
AAMLVLMCoGe
306
62
0
07 Oct 2023
Module-wise Adaptive Distillation for Multimodality Foundation Models
Module-wise Adaptive Distillation for Multimodality Foundation ModelsNeural Information Processing Systems (NeurIPS), 2023
Chen Liang
Jiahui Yu
Ming-Hsuan Yang
Matthew A. Brown
Huayu Chen
Tuo Zhao
Boqing Gong
Tianyi Zhou
158
12
0
06 Oct 2023
Envisioning Narrative Intelligence: A Creative Visual Storytelling
  Anthology
Envisioning Narrative Intelligence: A Creative Visual Storytelling AnthologyInternational Conference on Human Factors in Computing Systems (CHI), 2023
Brett A. Halperin
S. Lukin
CoGe
165
29
0
06 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
  Models
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2023
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
229
19
0
04 Oct 2023
ReForm-Eval: Evaluating Large Vision Language Models via Unified
  Re-Formulation of Task-Oriented Benchmarks
ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented BenchmarksACM Multimedia (ACM MM), 2023
Zejun Li
Ye Wang
Mengfei Du
Qingwen Liu
Binhao Wu
...
Zhihao Fan
Jie Fu
Jingjing Chen
Xuanjing Huang
Zhongyu Wei
226
15
0
04 Oct 2023
Constructing Image-Text Pair Dataset from Books
Constructing Image-Text Pair Dataset from Books
Yamato Okamoto
Haruto Toyonaga
Yoshihisa Ijiri
Hirokatsu Kataoka
156
4
0
03 Oct 2023
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
  Prediction
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense PredictionInternational Conference on Learning Representations (ICLR), 2023
Size Wu
Wenwei Zhang
Lumin Xu
Sheng Jin
Xiangtai Li
Wentao Liu
Chen Change Loy
CLIPVLM
205
98
0
02 Oct 2023
Towards reporting bias in visual-language datasets: bimodal augmentation
  by decoupling object-attribute association
Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association
Qiyu Wu
Mengjie Zhao
Yutong He
Lang Huang
Junya Ono
Hiromi Wakaki
Yuki Mitsufuji
252
5
0
02 Oct 2023
Making LLaMA SEE and Draw with SEED Tokenizer
Making LLaMA SEE and Draw with SEED TokenizerInternational Conference on Learning Representations (ICLR), 2023
Yuying Ge
Sijie Zhao
Ziyun Zeng
Yixiao Ge
Chen Li
Xintao Wang
Ying Shan
141
174
0
02 Oct 2023
Understanding Transferable Representation Learning and Zero-shot
  Transfer in CLIP
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIPInternational Conference on Learning Representations (ICLR), 2023
Zixiang Chen
Yihe Deng
Yuanzhi Li
Quanquan Gu
VLM
301
16
0
02 Oct 2023
Region-centric Image-Language Pretraining for Open-Vocabulary Detection
Region-centric Image-Language Pretraining for Open-Vocabulary DetectionEuropean Conference on Computer Vision (ECCV), 2023
Dahun Kim
A. Angelova
Weicheng Kuo
ObjDVLM
198
6
0
29 Sep 2023
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
Directly Fine-Tuning Diffusion Models on Differentiable RewardsInternational Conference on Learning Representations (ICLR), 2023
Amita Gajewar
Paul Vicol
G. Bansal
David J Fleet
219
275
0
29 Sep 2023
YOLOR-Based Multi-Task Learning
YOLOR-Based Multi-Task Learning
Hung-Shuo Chang
Chien-Yao Wang
Hang Yan
Yukun Zhu
Hongpeng Liao
MoEVLM
157
20
0
29 Sep 2023
Self-supervised Cross-view Representation Reconstruction for Change
  Captioning
Self-supervised Cross-view Representation Reconstruction for Change CaptioningIEEE International Conference on Computer Vision (ICCV), 2023
Yunbin Tu
Liang Li
Filippos Christianos
Zheng-Jun Zha
Zhibin Li
Qingming Huang
SSL
161
36
0
28 Sep 2023
InternLM-XComposer: A Vision-Language Large Model for Advanced
  Text-image Comprehension and Composition
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Pan Zhang
Xiaoyi Wang
Bin Wang
Yuhang Cao
Chao Xu
...
Conghui He
Xingcheng Zhang
Yu Qiao
Da Lin
Yuan Liu
MLLM
610
299
0
26 Sep 2023
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive LossNeural Information Processing Systems (NeurIPS), 2023
R. S. Srinivasa
Jaejin Cho
Chouchang Yang
Yashas Malur Saidutta
Ching Hua Lee
Yilin Shen
Hongxia Jin
VLM
168
15
0
26 Sep 2023
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level
  Vision
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level VisionInternational Conference on Learning Representations (ICLR), 2023
Haoning Wu
Zicheng Zhang
Erli Zhang
Chaofeng Chen
Liang Liao
...
Chunyi Li
Wenxiu Sun
Qiong Yan
Guangtao Zhai
Weisi Lin
VLM
296
215
0
25 Sep 2023
Semi-Supervised Domain Generalization for Object Detection via
  Language-Guided Feature Alignment
Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature AlignmentBritish Machine Vision Conference (BMVC), 2023
Sina Malakouti
Adriana Kovashka
ObjD
149
2
0
24 Sep 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
252
20
0
23 Sep 2023
Detect Everything with Few Examples
Detect Everything with Few ExamplesConference on Robot Learning (CoRL), 2023
Xinyu Zhang
Yuting Wang
Abdeslam Boularias
ObjDVLM
269
22
0
22 Sep 2023
Weakly-supervised Automated Audio Captioning via text only training
Weakly-supervised Automated Audio Captioning via text only training
Theodoros Kouzelis
Vassilis Katsouros
CLIP
168
10
0
21 Sep 2023
Viewpoint Integration and Registration with Vision Language Foundation
  Model for Image Change Understanding
Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding
Xiaonan Lu
Jianlong Yuan
Ruigang Niu
Yuan Hu
Fan Wang
112
3
0
15 Sep 2023
Looking at words and points with attention: a benchmark for
  text-to-shape coherence
Looking at words and points with attention: a benchmark for text-to-shape coherence
Andrea Amaduzzi
Giuseppe Lisanti
Samuele Salti
Luigi Di Stefano
116
3
0
14 Sep 2023
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the
  Wild
TextBind: Multi-turn Interleaved Multimodal Instruction-following in the WildAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Huayang Li
Siheng Li
Deng Cai
Longyue Wang
Lemao Liu
Taro Watanabe
Yujiu Yang
Shuming Shi
MLLM
253
22
0
14 Sep 2023
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
SwitchGPT: Adapting Large Language Models for Non-Text Outputs
Xinyu Wang
Bohan Zhuang
Qi Wu
MLLM
124
4
0
14 Sep 2023
ITI-GEN: Inclusive Text-to-Image Generation
ITI-GEN: Inclusive Text-to-Image GenerationIEEE International Conference on Computer Vision (ICCV), 2023
Cheng Zhang
Xuanbai Chen
Siqi Chai
Chen Henry Wu
Dmitry Lagun
Thabo Beeler
Fernando de la Torre
VLM
207
75
0
11 Sep 2023
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
  Retrieval
Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal RetrievalIEEE Transactions on Image Processing (IEEE TIP), 2023
Yabing Wang
Shuhui Wang
Hao Luo
Jianfeng Dong
F. Wang
Meng Han
Xun Wang
Meng Wang
170
13
0
11 Sep 2023
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuningInternational Conference on Learning Representations (ICLR), 2023
Zhengxiang Shi
Aldo Lipani
VLM
403
41
0
11 Sep 2023
ImageBind-LLM: Multi-modality Instruction Tuning
ImageBind-LLM: Multi-modality Instruction Tuning
Jiaming Han
Renrui Zhang
Wenqi Shao
Shiyang Feng
Peng Xu
...
Yafei Wen
Xiaoxin Chen
Xiangyu Yue
Jiaming Song
Yu Qiao
MLLM
232
148
0
07 Sep 2023
DetermiNet: A Large-Scale Diagnostic Dataset for Complex
  Visually-Grounded Referencing using Determiners
DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using DeterminersIEEE International Conference on Computer Vision (ICCV), 2023
Clarence Lee
M Ganesh Kumar
Cheston Tan
139
3
0
07 Sep 2023
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
  Tuning
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
L. Yu
Bowen Shi
Ramakanth Pasunuru
Benjamin Muller
O. Yu. Golovneva
...
Yaniv Taigman
Maryam Fazel-Zarandi
Asli Celikyilmaz
Luke Zettlemoyer
Armen Aghajanyan
MLLM
191
160
0
05 Sep 2023
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical
  Learning
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical LearningComputer Vision and Pattern Recognition (CVPR), 2023
Wei Suo
Mengyang Sun
Weisong Liu
Yi-Meng Gao
Peifeng Wang
Yanning Zhang
Qi Wu
LRM
149
11
0
05 Sep 2023
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
Taehoon Kim
Pyunghwan Ahn
Sangyun Kim
Sihaeng Lee
Mark A Marsden
...
Yujin Wang
Yimu Wang
Tiancheng Gu
Xingchang Lv
Mingmao Sun
VLM
192
8
0
05 Sep 2023
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised
  Semantic Segmentation
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic SegmentationAsian Conference on Computer Vision (ACCV), 2023
Ryota Yoshihashi
Yuya Otsuka
Kenji Doi
Tomohiro Tanaka
Hirokatsu Kataoka
370
4
0
04 Sep 2023
Contrastive Feature Masking Open-Vocabulary Vision Transformer
Contrastive Feature Masking Open-Vocabulary Vision TransformerIEEE International Conference on Computer Vision (ICCV), 2023
Dahun Kim
A. Angelova
Weicheng Kuo
ObjDVLM
262
35
0
02 Sep 2023
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D
  Understanding, Generation, and Instruction Following
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
Ziyu Guo
Renrui Zhang
Xiangyang Zhu
Yiwen Tang
Xianzheng Ma
...
Ke Chen
Shiyang Feng
Xianzhi Li
Jiaming Song
Pheng-Ann Heng
MLLM
312
184
0
01 Sep 2023
Large Content And Behavior Models To Understand, Simulate, And Optimize
  Content And Behavior
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And BehaviorInternational Conference on Learning Representations (ICLR), 2023
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
275
10
0
01 Sep 2023
Towards Addressing the Misalignment of Object Proposal Evaluation for
  Vision-Language Tasks via Semantic Grounding
Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic GroundingIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Joshua Forster Feinglass
Yezhou Yang
132
2
0
01 Sep 2023
TouchStone: Evaluating Vision-Language Models by Language Models
TouchStone: Evaluating Vision-Language Models by Language Models
Shuai Bai
Shusheng Yang
Jinze Bai
Peng Wang
Xing Zhang
Junyang Lin
Xinggang Wang
Chang Zhou
Jingren Zhou
MLLM
226
56
0
31 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
  Detection
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object DetectionIEEE Transactions on Image Processing (IEEE TIP), 2023
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
169
8
0
30 Aug 2023
Evaluation and Analysis of Hallucination in Large Vision-Language Models
Evaluation and Analysis of Hallucination in Large Vision-Language Models
Junyan Wang
Yi Zhou
Guohai Xu
Pengcheng Shi
Chenlin Zhao
...
Mingshi Yan
Ji Zhang
Jihua Zhu
Jitao Sang
Haoyu Tang
MLLM
212
90
0
29 Aug 2023
AI-Generated Content (AIGC) for Various Data Modalities: A Survey
AI-Generated Content (AIGC) for Various Data Modalities: A SurveyACM Computing Surveys (ACM Comput. Surv.), 2023
Lin Geng Foo
Hossein Rahmani
Jing Liu
636
42
0
27 Aug 2023
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual
  Captioning
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual CaptioningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Bang-ju Yang
Fenglin Liu
X. Wu
Yaowei Wang
Xu Sun
Yuexian Zou
VLMCLIP
188
18
0
25 Aug 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding,
  Localization, Text Reading, and Beyond
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLMVLMObjD
417
1,488
0
24 Aug 2023
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data
SCoRD: Subject-Conditional Relation Detection with Text-Augmented DataIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Ziyan Yang
Kushal Kafle
Zhe Lin
Scott D. Cohen
Zhihong Ding
Vicente Ordonez
188
1
0
24 Aug 2023
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt
  interaction tasks
HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks
Zichao Dong
Weikun Zhang
Xufeng Huang
Hang Ji
Xin Zhan
Junbo Chen
VLM
71
6
0
24 Aug 2023
RefEgo: Referring Expression Comprehension Dataset from First-Person
  Perception of Ego4D
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4DIEEE International Conference on Computer Vision (ICCV), 2023
Shuhei Kurita
Naoki Katsura
Eri Onami
EgoV
182
22
0
23 Aug 2023
VQA Therapy: Exploring Answer Differences by Visually Grounding Answers
VQA Therapy: Exploring Answer Differences by Visually Grounding AnswersIEEE International Conference on Computer Vision (ICCV), 2023
Chongyan Chen
Samreen Anjum
Danna Gurari
202
15
0
21 Aug 2023
Explore and Tell: Embodied Visual Captioning in 3D Environments
Explore and Tell: Embodied Visual Captioning in 3D EnvironmentsIEEE International Conference on Computer Vision (ICCV), 2023
Anwen Hu
Shizhe Chen
Liang Zhang
Qin Jin
LM&Ro
157
3
0
21 Aug 2023
Previous
123...111213...293031
Next