ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXiv (abs)PDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown
Think Before You Act: A Two-Stage Framework for Mitigating Gender Bias
  Towards Vision-Language Tasks
Think Before You Act: A Two-Stage Framework for Mitigating Gender Bias Towards Vision-Language Tasks
Yunqi Zhang
Songda Li
Chunyuan Deng
Luyi Wang
Hui Zhao
336
0
0
27 May 2024
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
C. N. Vasconcelos
Abdullah Rashwan Austin Waters
Trevor Walker
Keyang Xu
Jimmy Yan
...
Wenlei Zhou
Kevin Swersky
David J. Fleet
Jason Baldridge
Oliver Wang
208
4
0
27 May 2024
Multilingual Diversity Improves Vision-Language Representations
Multilingual Diversity Improves Vision-Language Representations
Thao Nguyen
Matthew Wallingford
Sebastin Santy
Wei-Chiu Ma
Sewoong Oh
Ludwig Schmidt
Pang Wei Koh
Ranjay Krishna
VLM
352
15
0
27 May 2024
A Survey of Multimodal Large Language Model from A Data-centric
  Perspective
A Survey of Multimodal Large Language Model from A Data-centric Perspective
Tianyi Bai
Hao Liang
Binwang Wan
Yanran Xu
Xi Li
...
Ping Huang
Jiulong Shan
Conghui He
Binhang Yuan
Wentao Zhang
393
64
0
26 May 2024
OmniBind: Teach to Build Unequal-Scale Modality Interaction for
  Omni-Bind of All
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All
Yuanhuiyi Lyu
Xueye Zheng
Dahun Kim
Lin Wang
271
20
0
25 May 2024
Disease-informed Adaptation of Vision-Language Models
Disease-informed Adaptation of Vision-Language Models
Jiajin Zhang
Ge Wang
Mannudeep K. Kalra
Pingkun Yan
VLM
275
7
0
24 May 2024
Defensive Unlearning with Adversarial Training for Robust Concept
  Erasure in Diffusion Models
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models
Yimeng Zhang
Xin Chen
Jinghan Jia
Yihua Zhang
Chongyu Fan
Jiancheng Liu
Mingyi Hong
Ke Ding
Sijia Liu
DiffM
468
110
0
24 May 2024
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
Run Luo
Yunshui Li
Longze Chen
Wanwei He
Ting-En Lin
...
Zikai Song
Xiaobo Xia
Tongliang Liu
Min Yang
Binyuan Hui
VLMDiffM
453
34
0
24 May 2024
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference
Jiarui Fang
Jinzhe Pan
Aoyu Li
PengCheng Yang
Jiannan Wang
AI4CE
542
8
0
23 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
910
169
0
23 May 2024
Safety Alignment for Vision Language Models
Safety Alignment for Vision Language Models
Zhendong Liu
Yuanbi Nie
Yingshui Tan
Xiangyu Yue
Qiushi Cui
Chongjun Wang
Xiaoyong Zhu
Bo Zheng
VLMMLLM
272
17
0
22 May 2024
Efficient Multimodal Large Language Models: A Survey
Efficient Multimodal Large Language Models: A Survey
Yizhang Jin
Jian Li
Yexin Liu
Tianjun Gu
Kai Wu
...
Xin Tan
Zhenye Gan
Yabiao Wang
Chengjie Wang
Lizhuang Ma
LRM
307
86
0
17 May 2024
Libra: Building Decoupled Vision System on Large Language Models
Libra: Building Decoupled Vision System on Large Language ModelsInternational Conference on Machine Learning (ICML), 2024
Yifan Xu
Xiaoshan Yang
Y. Song
Changsheng Xu
MLLMVLM
202
10
0
16 May 2024
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Oncel Tuzel
VLMCLIP
266
9
0
14 May 2024
Open-Vocabulary Object Detection via Neighboring Region Attention
  Alignment
Open-Vocabulary Object Detection via Neighboring Region Attention AlignmentEngineering applications of artificial intelligence (EAAI), 2024
Sunyuan Qiang
Xianfei Li
Yanyan Liang
Wenlong Liao
Tao He
Pai Peng
ObjD
229
0
0
14 May 2024
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Memory-Space Visual Prompting for Efficient Vision-Language Fine-TuningInternational Conference on Machine Learning (ICML), 2024
Shibo Jie
Yehui Tang
Ning Ding
Zhi-Hong Deng
Kai Han
Yunhe Wang
VLM
345
20
0
09 May 2024
Universal Adversarial Perturbations for Vision-Language Pre-trained
  Models
Universal Adversarial Perturbations for Vision-Language Pre-trained ModelsAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024
Pengfei Zhang
Zi Huang
Guangdong Bai
AAML
195
25
0
09 May 2024
MANTIS: Interleaved Multi-Image Instruction Tuning
MANTIS: Interleaved Multi-Image Instruction Tuning
Dongfu Jiang
Xuan He
Huaye Zeng
Cong Wei
Max Ku
Qian Liu
Wenhu Chen
VLMMLLM
420
183
0
02 May 2024
FITA: Fine-grained Image-Text Aligner for Radiology Report Generation
FITA: Fine-grained Image-Text Aligner for Radiology Report Generation
Honglong Yang
Hui Tang
Xiaomeng Li
MedIm
210
3
0
02 May 2024
DOCCI: Descriptions of Connected and Contrasting Images
DOCCI: Descriptions of Connected and Contrasting Images
Yasumasa Onoe
Sunayana Rane
Zachary Berger
Yonatan Bitton
Jaemin Cho
...
Zarana Parekh
Jordi Pont-Tuset
Garrett Tanzer
Su Wang
Jason Baldridge
275
97
0
30 Apr 2024
Exploring the Distinctiveness and Fidelity of the Descriptions Generated
  by Large Vision-Language Models
Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
Yuhang Huang
Zihan Wu
Chongyang Gao
Jiawei Peng
Xu Yang
213
2
0
26 Apr 2024
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
  Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen
Weiyun Wang
Hao Tian
Shenglong Ye
Zhangwei Gao
...
Tong Lu
Dahua Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
MLLMVLM
530
994
0
25 Apr 2024
DesignProbe: A Graphic Design Benchmark for Multimodal Large Language
  Models
DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models
Jieru Lin
Danqing Huang
Tiejun Zhao
Dechen Zhan
Chin-Yew Lin
VLMMLLM
228
4
0
23 Apr 2024
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking
  Enhances Visual Commonsense Reasoning
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
Mingjie Ma
Zhihuan Yu
Yichao Ma
Guohui Li
LRM
201
2
0
22 Apr 2024
The Solution for the CVPR2024 NICE Image Captioning Challenge
The Solution for the CVPR2024 NICE Image Captioning Challenge
Longfei Huang
Shupeng Zhong
Xiangyu Wu
Ruoxuan Li
216
0
0
19 Apr 2024
Towards Multi-modal Transformers in Federated Learning
Towards Multi-modal Transformers in Federated Learning
Guangyu Sun
Matías Mendieta
Aritra Dutta
Xin Li
Chong Chen
286
14
0
18 Apr 2024
ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis
ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis
Aashish Anantha Ramakrishnan
Sharon X. Huang
Dongwon Lee
225
0
0
15 Apr 2024
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and
  Benchmark
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark
Zhaokun Zhou
Qiulin Wang
Bin Lin
Yiwei Su
Ruoxin Chen
Xin Tao
Amin Zheng
Li-xin Yuan
Pengfei Wan
Di Zhang
145
30
0
15 Apr 2024
COCONut: Modernizing COCO Segmentation
COCONut: Modernizing COCO Segmentation
XueQing Deng
Qihang Yu
Peng Wang
Xiaohui Shen
Liang-Chieh Chen
206
22
0
12 Apr 2024
Training-free Boost for Open-Vocabulary Object Detection with Confidence
  Aggregation
Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation
Yanhao Zheng
Kai Liu
ObjD
206
3
0
12 Apr 2024
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and
  Training Strategies
Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
Zichao Li
Cihang Xie
E. D. Cubuk
CLIP
220
11
0
12 Apr 2024
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models
Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Simon Schrodi
David T. Hoffmann
Max Argus
Volker Fischer
Thomas Brox
VLM
518
4
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language modelsEuropean Conference on Computer Vision (ECCV), 2024
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLMVLM
308
58
0
10 Apr 2024
Training-Free Open-Vocabulary Segmentation with Offline
  Diffusion-Augmented Prototype Generation
Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype GenerationComputer Vision and Pattern Recognition (CVPR), 2024
Luca Barsellotti
Roberto Amoroso
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
VLMDiffM
221
28
0
09 Apr 2024
Improving Interpretable Embeddings for Ad-hoc Video Search with
  Generative Captions and Multi-word Concept Bank
Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank
Jiaxin Wu
Chong-Wah Ngo
W. Chan
VGen
190
3
0
09 Apr 2024
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
Matteo Farina
Goran Frehse
Elia Cunegatti
Gaowen Liu
Giovanni Iacca
Elisa Ricci
VLM
278
8
0
08 Apr 2024
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
Yuxi Ren
Jie Wu
Yanzuo Lu
Huafeng Kuang
Xin Xia
...
Shiyin Wang
Xuefeng Xiao
Yitong Wang
Min Zheng
Xing Mei
175
10
0
07 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
  Matching
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept MatchingNeural Information Processing Systems (NeurIPS), 2024
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Jiaming Song
VLM
459
54
0
04 Apr 2024
Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps
  via LLM
Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLMInternational Conference on Human Factors in Computing Systems (CHI), 2024
Zhe Liu
Chunyang Chen
Peng Li
Mengzhuo Chen
Boyu Wu
Yuekai Huang
Jun Hu
Qing Wang
175
22
0
03 Apr 2024
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic
  Assisted Pseudo-labeling
Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labelingIEEE transactions on multimedia (IEEE TMM), 2024
Xu Wang
Yifan Li
Qiudan Zhang
Wen-Bin Wu
Mark Junjie Li
Jianmin Jinag
252
5
0
03 Apr 2024
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
ViTamin: Designing Scalable Vision Models in the Vision-Language EraComputer Vision and Pattern Recognition (CVPR), 2024
Jienneg Chen
Qihang Yu
Xiaohui Shen
Yaoyao Liu
Liang-Chieh Chen
3DVVLM
415
50
0
02 Apr 2024
mChartQA: A universal benchmark for multimodal Chart Question Answer
  based on Vision-Language Alignment and Reasoning
mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning
Jingxuan Wei
Nan Xu
Guiyong Chang
Yin Luo
Bihui Yu
Ruifeng Guo
212
8
0
02 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question
  Answering
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
242
3
0
01 Apr 2024
LLaMA-Excitor: General Instruction Tuning via Indirect Feature
  Interaction
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
236
8
0
01 Apr 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative
  Vision-Language Reasoning
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Rongjie Li
Yu Wu
Xuming He
MLLMLRMVLM
211
3
0
01 Apr 2024
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with
  Vision-Language Models
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models
Rongjie Li
Songyang Zhang
Dahua Lin
Kai-xiang Chen
Xuming He
VLM
339
43
0
01 Apr 2024
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large
  Language Model
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
Lirui Zhao
Yue Yang
Kaipeng Zhang
Wenqi Shao
Yuxin Zhang
Yu Qiao
Ping Luo
Rongrong Ji
LM&RoLLMAGVLM
134
7
0
31 Mar 2024
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
Kai Zhang
Yi Luan
Hexiang Hu
Kenton Lee
Siyuan Qiao
Wenhu Chen
Yu-Chuan Su
Ming-Wei Chang
VLMLRM
307
77
0
28 Mar 2024
LocCa: Visual Pretraining with Location-aware Captioners
LocCa: Visual Pretraining with Location-aware Captioners
Bo Wan
Michael Tschannen
Yongqin Xian
Filip Pavetić
Ibrahim Alabdulmohsin
Xiao Wang
André Susano Pinto
Andreas Steiner
Lucas Beyer
Xiao-Qi Zhai
VLM
379
22
0
28 Mar 2024
ACES: Evaluating Automated Audio Captioning Models on the Semantics of
  Sounds
ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds
Gijs Wijngaard
Elia Formisano
Bruno L. Giordano
M. Dumontier
214
5
0
27 Mar 2024
Previous
123...789...293031
Next
Page 8 of 31
Pageof 31