ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXiv (abs)PDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,515 papers shown
Title
Generic Attention-model Explainability by Weighted Relevance
  Accumulation
Generic Attention-model Explainability by Weighted Relevance AccumulationACM Multimedia Asia (MA), 2023
Yiming Huang
Ao Jia
Xiaodan Zhang
Jiawei Zhang
110
4
0
20 Aug 2023
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity
  Control
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity ControlIEEE International Conference on Computer Vision (ICCV), 2023
Zi-Yuan Hu
Yanyang Li
Michael R. Lyu
Liwei Wang
VLM
149
23
0
18 Aug 2023
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
RLIPv2: Fast Scaling of Relational Language-Image Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2023
Hangjie Yuan
Shiwei Zhang
Xiang Wang
Samuel Albanie
Yining Pan
Tao Feng
Jianwen Jiang
Dong Ni
Yingya Zhang
Deli Zhao
VLM
195
59
0
18 Aug 2023
Diffusion Based Augmentation for Captioning and Retrieval in Cultural
  Heritage
Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage
Dario Cioni
Lorenzo Berlincioni
Federico Becattini
Marco Bertini
DiffM
112
13
0
14 Aug 2023
Foundation Model is Efficient Multimodal Multitask Model Selector
Foundation Model is Efficient Multimodal Multitask Model SelectorNeural Information Processing Systems (NeurIPS), 2023
Fanqing Meng
Wenqi Shao
Zhanglin Peng
Chong Jiang
Kaipeng Zhang
Yu Qiao
Ping Luo
123
21
0
11 Aug 2023
Your Negative May not Be True Negative: Boosting Image-Text Matching
  with False Negative Elimination
Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative EliminationACM Multimedia (ACM MM), 2023
Haoxuan Li
Yi Bin
Junrong Liao
Yang Yang
Heng Tao Shen
161
42
0
08 Aug 2023
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval
Unifying Two-Stream Encoders with Transformers for Cross-Modal RetrievalACM Multimedia (ACM MM), 2023
Yi Bin
Haoxuan Li
Yahui Xu
Xing Xu
Yang Yang
Heng Tao Shen
VOS
133
28
0
08 Aug 2023
Tiny LVLM-eHub: Early Multimodal Experiments with Bard
Tiny LVLM-eHub: Early Multimodal Experiments with BardIEEE Transactions on Big Data (IEEE Trans. Big Data), 2023
Wenqi Shao
Yutao Hu
Shiyang Feng
Meng Lei
Kaipeng Zhang
...
Peng Xu
Siyuan Huang
Jiaming Song
Yuning Qiao
Ping Luo
VLMMLLM
187
24
0
07 Aug 2023
EventBind: Learning a Unified Representation to Bind Them All for
  Event-based Open-world Understanding
EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world UnderstandingEuropean Conference on Computer Vision (ECCV), 2023
Jiazhou Zhou
Xueye Zheng
Yuanhuiyi Lyu
Lin Wang
VLM
305
25
0
06 Aug 2023
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
MM-Vet: Evaluating Large Multimodal Models for Integrated CapabilitiesInternational Conference on Machine Learning (ICML), 2023
Weihao Yu
Zhengyuan Yang
Linjie Li
Jianfeng Wang
Kevin Qinghong Lin
Zicheng Liu
Xinchao Wang
Lijuan Wang
MLLM
378
989
0
04 Aug 2023
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen
  Convolutional CLIP
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIPNeural Information Processing Systems (NeurIPS), 2023
Qihang Yu
Ju He
XueQing Deng
Xiaohui Shen
Liang-Chieh Chen
VLMCLIP
225
197
0
04 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and
  Understanding of the Open World
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open WorldInternational Conference on Learning Representations (ICLR), 2023
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRMMLLM
217
115
0
03 Aug 2023
Reverse Stable Diffusion: What prompt was used to generate this image?
Reverse Stable Diffusion: What prompt was used to generate this image?Computer Vision and Image Understanding (CVIU), 2023
Florinel-Alin Croitoru
Vlad Hondru
Radu Tudor Ionescu
M. Shah
VLMDiffM
220
9
0
02 Aug 2023
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive
  Vision-Language Models
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla
Irena Gao
Josh Gardner
Jack Hessel
Yusuf Hanafy
...
Simon Kornblith
Pang Wei Koh
Gabriel Ilharco
Mitchell Wortsman
Ludwig Schmidt
MLLM
308
523
0
02 Aug 2023
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge
  using Vision-Language Pre-Training Model
Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training ModelACM Multimedia (ACM MM), 2023
Ka Leong Cheng
Wenpo Song
Zheng Ma
Wenhao Zhu
Zi-Yue Zhu
Jianbing Zhang
CLIPVLM
130
18
0
02 Aug 2023
Transferable Decoding with Visual Entities for Zero-Shot Image
  Captioning
Transferable Decoding with Visual Entities for Zero-Shot Image CaptioningIEEE International Conference on Computer Vision (ICCV), 2023
Junjie Fei
Teng Wang
Jinrui Zhang
Zhenyu He
Chengjie Wang
Feng Zheng
VLM
141
60
0
31 Jul 2023
Exploring Annotation-free Image Captioning with Retrieval-augmented
  Pseudo Sentence Generation
Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence GenerationACM Multimedia Asia (MA), 2023
Zhiyuan Li
Dongnan Liu
Heng Wang
Chaoyi Zhang
Weidong (Tom) Cai
RALM
122
1
0
27 Jul 2023
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Kun Yuan
V. Srivastav
Tong Yu
Joël L. Lavanchy
J. Marescaux
Pietro Mascagni
Nassir Navab
N. Padoy
559
44
0
27 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming-Hsuan Yang
Fahad Shahbaz Khan
VLM
380
149
0
25 Jul 2023
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset
  and Comprehensive Framework
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework
Jingxuan Wei
Cheng Tan
Zhangyang Gao
Linzhuang Sun
Siyuan Li
Bihui Yu
R. Guo
Stan Z. Li
LRM
295
16
0
24 Jul 2023
Expert Knowledge-Aware Image Difference Graph Representation Learning
  for Difference-Aware Medical Visual Question Answering
Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question AnsweringKnowledge Discovery and Data Mining (KDD), 2023
Xinyue Hu
Lin Gu
Qi A. An
Mengliang Zhang
Liangchen Liu
Kazuma Kobayashi
Tatsuya Harada
Ronald M. Summers
Yingying Zhu
MedIm
177
49
0
22 Jul 2023
OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?
OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?IEEE International Conference on Computer Vision (ICCV), 2023
Runjia Li
Shuyang Sun
Mohamed Elhoseiny
Juil Sock
163
16
0
21 Jul 2023
Divide & Bind Your Attention for Improved Generative Semantic Nursing
Divide & Bind Your Attention for Improved Generative Semantic NursingBritish Machine Vision Conference (BMVC), 2023
Yumeng Li
Margret Keuper
Dan Zhang
Anna Khoreva
DiffM
237
74
0
20 Jul 2023
Reference-based Painterly Inpainting via Diffusion: Crossing the Wild
  Reference Domain Gap
Reference-based Painterly Inpainting via Diffusion: Crossing the Wild Reference Domain Gap
Dejia Xu
Xingqian Xu
Wenyan Cong
Humphrey Shi
Zinan Lin
DiffM
132
4
0
20 Jul 2023
Improving Multimodal Datasets with Image Captioning
Improving Multimodal Datasets with Image CaptioningNeural Information Processing Systems (NeurIPS), 2023
Thao Nguyen
S. Gadre
Gabriel Ilharco
Sewoong Oh
Ludwig Schmidt
VLM
187
121
0
19 Jul 2023
Embedded Heterogeneous Attention Transformer for Cross-lingual Image
  Captioning
Embedded Heterogeneous Attention Transformer for Cross-lingual Image CaptioningIEEE transactions on multimedia (IEEE TMM), 2023
Zijie Song
Zhenzhen Hu
Yuanen Zhou
Ye Zhao
Richang Hong
Meng Wang
139
18
0
19 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present,
  and Future
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and FutureIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Chaoyang Zhu
Long Chen
ObjDVLM
407
63
0
18 Jul 2023
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
SINC: Self-Supervised In-Context Learning for Vision-Language TasksIEEE International Conference on Computer Vision (ICCV), 2023
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLMLRM
195
7
0
15 Jul 2023
Gloss Attention for Gloss-free Sign Language Translation
Gloss Attention for Gloss-free Sign Language TranslationComputer Vision and Pattern Recognition (CVPR), 2023
Aoxiong Yin
Tianyun Zhong
Lilian H. Y. Tang
Weike Jin
Tao Jin
Zhou Zhao
SLR
171
60
0
14 Jul 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?European Conference on Computer Vision (ECCV), 2023
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Yuan Liu
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
480
1,578
0
12 Jul 2023
Emu: Generative Pretraining in Multimodality
Emu: Generative Pretraining in MultimodalityInternational Conference on Learning Representations (ICLR), 2023
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
281
154
0
11 Jul 2023
Semantic-SAM: Segment and Recognize Anything at Any Granularity
Semantic-SAM: Segment and Recognize Anything at Any Granularity
Feng Li
Hao Zhang
Pei Sun
Xueyan Zou
Siyi Liu
Jianwei Yang
Chun-yue Li
Lei Zhang
Jianfeng Gao
VLM
221
215
0
10 Jul 2023
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
MLLMVLM
689
304
0
07 Jul 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
138
6
0
06 Jul 2023
T-MARS: Improving Visual Representations by Circumventing Text Feature
  Learning
T-MARS: Improving Visual Representations by Circumventing Text Feature LearningInternational Conference on Learning Representations (ICLR), 2023
Pratyush Maini
Sachin Goyal
Zachary Chase Lipton
J. Zico Kolter
Aditi Raghunathan
VLM
127
40
0
06 Jul 2023
On the Cultural Gap in Text-to-Image Generation
On the Cultural Gap in Text-to-Image GenerationEuropean Conference on Artificial Intelligence (ECAI), 2023
Bingshuai Liu
Longyue Wang
Chenyang Lyu
Yong Zhang
Jinsong Su
Shuming Shi
Zhaopeng Tu
VLMEGVM
115
12
0
06 Jul 2023
Several categories of Large Language Models (LLMs): A Short Survey
Several categories of Large Language Models (LLMs): A Short SurveyInternational Journal for Research in Applied Science and Engineering Technology (IJRASET), 2023
Saurabh Pahune
Manoj Chandrasekharan
AILaw
142
26
0
05 Jul 2023
What Matters in Training a GPT4-Style Language Model with Multimodal
  Inputs?
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?North American Chapter of the Association for Computational Linguistics (NAACL), 2023
Yan Zeng
Hanbo Zhang
Jiani Zheng
Jiangnan Xia
Guoqiang Wei
Yang Wei
Yuchen Zhang
Tao Kong
MLLM
230
88
0
05 Jul 2023
Multimodal Prompt Learning for Product Title Generation with Extremely
  Limited Labels
Multimodal Prompt Learning for Product Title Generation with Extremely Limited LabelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Bang-ju Yang
Fenglin Liu
Zheng Li
Qingyu Yin
Chenyu You
Bing Yin
Yuexian Zou
VLM
172
5
0
05 Jul 2023
Visual Instruction Tuning with Polite Flamingo
Visual Instruction Tuning with Polite FlamingoAAAI Conference on Artificial Intelligence (AAAI), 2023
Delong Chen
Jianfeng Liu
Wenliang Dai
Baoyuan Wang
MLLM
324
52
0
03 Jul 2023
JourneyDB: A Benchmark for Generative Image Understanding
JourneyDB: A Benchmark for Generative Image UnderstandingNeural Information Processing Systems (NeurIPS), 2023
Keqiang Sun
Junting Pan
Yuying Ge
Hao Li
Haodong Duan
...
Yi Wang
Jifeng Dai
Yu Qiao
Limin Wang
Jiaming Song
280
160
0
03 Jul 2023
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Rui Sun
Zhecan Wang
Haoxuan You
Noel Codella
Kai-Wei Chang
Shih-Fu Chang
CLIP
263
4
0
03 Jul 2023
A Massive Scale Semantic Similarity Dataset of Historical English
A Massive Scale Semantic Similarity Dataset of Historical EnglishNeural Information Processing Systems (NeurIPS), 2023
Emily Silcock
Melissa Dell
159
5
0
30 Jun 2023
CLIPAG: Towards Generator-Free Text-to-Image Generation
CLIPAG: Towards Generator-Free Text-to-Image GenerationIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Roy Ganz
Michael Elad
VLM
178
14
0
29 Jun 2023
Towards Open Vocabulary Learning: A Survey
Towards Open Vocabulary Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Jianzong Wu
Xiangtai Li
Shilin Xu
Haobo Yuan
Henghui Ding
...
Jiangning Zhang
Yu Tong
Xudong Jiang
Guohao Li
Dacheng Tao
ObjDVLM
326
210
0
28 Jun 2023
Semi-supervised Multimodal Representation Learning through a Global
  Workspace
Semi-supervised Multimodal Representation Learning through a Global WorkspaceIEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023
Benjamin Devillers
Léopold Maytié
R. V. Rullen
SSL
118
10
0
27 Jun 2023
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Qiong Wu
Shubin Huang
Weihao Ye
Pingyang Dai
Annan Shu
Guannan Jiang
Rongrong Ji
VLMVPVLM
107
2
0
27 Jun 2023
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Ke Chen
Zhao Zhang
Weili Zeng
Richong Zhang
Feng Zhu
Rui Zhao
ObjD
374
794
0
27 Jun 2023
Improving Reference-based Distinctive Image Captioning with Contrastive
  Rewards
Improving Reference-based Distinctive Image Captioning with Contrastive Rewards
Yangjun Mao
Jun Xiao
Dong Zhang
Meng Cao
Jian Shao
Yueting Zhuang
Long Chen
EGVM
140
9
0
25 Jun 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu
Peixian Chen
Chunjiang Ge
Yulei Qin
Mengdan Zhang
...
Xing Sun
Zhenyu Qiu
Rongrong Ji
Caifeng Shan
Ran He
ELMMLLM
645
1,170
0
23 Jun 2023
Previous
123...121314...293031
Next