ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXivPDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,387 papers shown
Title
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLM
LRM
26
4
0
15 Jul 2023
Gloss Attention for Gloss-free Sign Language Translation
Gloss Attention for Gloss-free Sign Language Translation
Aoxiong Yin
Tianyun Zhong
Lilian H. Y. Tang
Weike Jin
Tao Jin
Zhou Zhao
SLR
18
37
0
14 Jul 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Bo-wen Li
Songyang Zhang
...
Jiaqi Wang
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
29
907
0
12 Jul 2023
Emu: Generative Pretraining in Multimodality
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
37
126
0
11 Jul 2023
Semantic-SAM: Segment and Recognize Anything at Any Granularity
Semantic-SAM: Segment and Recognize Anything at Any Granularity
Feng Li
Hao Zhang
Pei Sun
Xueyan Zou
Siyi Liu
Jianwei Yang
Chun-yue Li
Lei Zhang
Jianfeng Gao
VLM
37
173
0
10 Jul 2023
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang
Pei Sun
Shoufa Chen
Min Xiao
Wenqi Shao
Wenwei Zhang
Yu Liu
Kai-xiang Chen
Ping Luo
VLM
MLLM
85
224
0
07 Jul 2023
Vision Language Transformers: A Survey
Vision Language Transformers: A Survey
Clayton Fields
C. Kennington
VLM
28
5
0
06 Jul 2023
T-MARS: Improving Visual Representations by Circumventing Text Feature
  Learning
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Pratyush Maini
Sachin Goyal
Zachary Chase Lipton
J. Zico Kolter
Aditi Raghunathan
VLM
42
33
0
06 Jul 2023
On the Cultural Gap in Text-to-Image Generation
On the Cultural Gap in Text-to-Image Generation
Bingshuai Liu
Longyue Wang
Chenyang Lyu
Yong Zhang
Jinsong Su
Shuming Shi
Zhaopeng Tu
VLM
EGVM
33
6
0
06 Jul 2023
Several categories of Large Language Models (LLMs): A Short Survey
Several categories of Large Language Models (LLMs): A Short Survey
Saurabh Pahune
Manoj Chandrasekharan
AILaw
25
14
0
05 Jul 2023
What Matters in Training a GPT4-Style Language Model with Multimodal
  Inputs?
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng
Hanbo Zhang
Jiani Zheng
Jiangnan Xia
Guoqiang Wei
Yang Wei
Yuchen Zhang
Tao Kong
MLLM
27
71
0
05 Jul 2023
Multimodal Prompt Learning for Product Title Generation with Extremely
  Limited Labels
Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels
Bang-ju Yang
Fenglin Liu
Zheng Li
Qingyu Yin
Chenyu You
Bing Yin
Yuexian Zou
VLM
33
5
0
05 Jul 2023
Visual Instruction Tuning with Polite Flamingo
Visual Instruction Tuning with Polite Flamingo
Delong Chen
Jianfeng Liu
Wenliang Dai
Baoyuan Wang
MLLM
34
42
0
03 Jul 2023
JourneyDB: A Benchmark for Generative Image Understanding
JourneyDB: A Benchmark for Generative Image Understanding
Keqiang Sun
Junting Pan
Yuying Ge
Hao Li
Haodong Duan
...
Yi Wang
Jifeng Dai
Yu Qiao
Limin Wang
Hongsheng Li
54
102
0
03 Jul 2023
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Rui Sun
Zhecan Wang
Haoxuan You
Noel Codella
Kai-Wei Chang
Shih-Fu Chang
CLIP
30
3
0
03 Jul 2023
A Massive Scale Semantic Similarity Dataset of Historical English
A Massive Scale Semantic Similarity Dataset of Historical English
Emily Silcock
Melissa Dell
39
5
0
30 Jun 2023
CLIPAG: Towards Generator-Free Text-to-Image Generation
CLIPAG: Towards Generator-Free Text-to-Image Generation
Roy Ganz
Michael Elad
VLM
28
7
0
29 Jun 2023
Towards Open Vocabulary Learning: A Survey
Towards Open Vocabulary Learning: A Survey
Jianzong Wu
Xiangtai Li
Shilin Xu
Haobo Yuan
Henghui Ding
...
Jiangning Zhang
Yu Tong
Xudong Jiang
Guohao Li
Dacheng Tao
ObjD
VLM
34
136
0
28 Jun 2023
Semi-supervised Multimodal Representation Learning through a Global
  Workspace
Semi-supervised Multimodal Representation Learning through a Global Workspace
Benjamin Devillers
Léopold Maytié
R. V. Rullen
SSL
24
5
0
27 Jun 2023
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Approximated Prompt Tuning for Vision-Language Pre-trained Models
Qiong Wu
Shubin Huang
Yiyi Zhou
Pingyang Dai
Annan Shu
Guannan Jiang
Rongrong Ji
VLM
VPVLM
25
2
0
27 Jun 2023
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Ke Chen
Zhao Zhang
Weili Zeng
Richong Zhang
Feng Zhu
Rui Zhao
ObjD
42
598
0
27 Jun 2023
Improving Reference-based Distinctive Image Captioning with Contrastive
  Rewards
Improving Reference-based Distinctive Image Captioning with Contrastive Rewards
Yangjun Mao
Jun Xiao
Dong Zhang
Meng Cao
Jian Shao
Yueting Zhuang
Long Chen
EGVM
29
9
0
25 Jun 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language
  Models
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu
Peixian Chen
Yunhang Shen
Yulei Qin
Mengdan Zhang
...
Xiawu Zheng
Ke Li
Xing Sun
Zhenyu Qiu
Rongrong Ji
ELM
MLLM
42
766
0
23 Jun 2023
Exploring the Role of Audio in Video Captioning
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
18
2
0
21 Jun 2023
VisoGender: A dataset for benchmarking gender bias in image-text pronoun
  resolution
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution
S. Hall
F. G. Abrantes
Hanwen Zhu
Grace A. Sodunke
Aleksandar Shtedritski
Hannah Rose Kirk
CoGe
21
39
0
21 Jun 2023
Dense Video Object Captioning from Disjoint Supervision
Dense Video Object Captioning from Disjoint Supervision
Xingyi Zhou
Anurag Arnab
Chen Sun
Cordelia Schmid
31
3
0
20 Jun 2023
How do different tokenizers perform on downstream tasks in scriptio
  continua languages?: A case study in Japanese
How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese
T. Fujii
Koki Shibata
Atsuki Yamaguchi
Terufumi Morishita
Yasuhiro Sogawa
18
13
0
16 Jun 2023
Human Preference Score v2: A Solid Benchmark for Evaluating Human
  Preferences of Text-to-Image Synthesis
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
Xiaoshi Wu
Yiming Hao
Keqiang Sun
Yixiong Chen
Feng Zhu
Rui Zhao
Hongsheng Li
46
252
0
15 Jun 2023
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
  Vision-Language Models
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models
Peng-Tao Xu
Wenqi Shao
Kaipeng Zhang
Peng Gao
Shuo Liu
Meng Lei
Fanqing Meng
Siyuan Huang
Yu Qiao
Ping Luo
ELM
MLLM
33
159
0
15 Jun 2023
Pragmatic Inference with a CLIP Listener for Contrastive Captioning
Pragmatic Inference with a CLIP Listener for Contrastive Captioning
Jiefu Ou
Benno Krojer
Daniel Fried
21
5
0
15 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
23
53
0
13 Jun 2023
Weakly Supervised Visual Question Answer Generation
Weakly Supervised Visual Question Answer Generation
Charani Alampalle
Shamanthak Hegde
Soumya Jahagirdar
Shankar Gangisetty
9
0
0
11 Jun 2023
Multi-modal Pre-training for Medical Vision-language Understanding and
  Generation: An Empirical Study with A New Benchmark
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
Li Xu
Bo Liu
Ameer Hamza Khan
Lu Fan
Xiao-Ming Wu
LM&MA
27
9
0
10 Jun 2023
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
Bo-wen Li
Yuanhan Zhang
Liangyu Chen
Jinghao Wang
Fanyi Pu
Jingkang Yang
C. Li
Ziwei Liu
MLLM
VLM
37
224
0
08 Jun 2023
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining
  Large Language Models
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
Wenxuan Zhang
Sharifah Mahani Aljunied
Chang Gao
Yew Ken Chia
Lidong Bing
ELM
26
81
0
08 Jun 2023
Table and Image Generation for Investigating Knowledge of Entities in
  Pre-trained Vision and Language Models
Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models
Hidetaka Kamigaito
Katsuhiko Hayashi
Taro Watanabe
VLM
15
1
0
03 Jun 2023
Benchmarking Robustness of Adaptation Methods on Pre-trained
  Vision-Language Models
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
Shuo Chen
Jindong Gu
Zhen Han
Yunpu Ma
Philip H. S. Torr
Volker Tresp
VPVLM
VLM
34
17
0
03 Jun 2023
AWQ: Activation-aware Weight Quantization for LLM Compression and
  Acceleration
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin
Jiaming Tang
Haotian Tang
Shang Yang
Wei-Ming Chen
Wei-Chen Wang
Guangxuan Xiao
Xingyu Dang
Chuang Gan
Song Han
EDL
MQ
36
470
0
01 Jun 2023
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language
  Perspective
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
Yingying Fan
Yu Wu
Bo Du
Yutian Lin
34
8
0
01 Jun 2023
Adapting Pre-trained Language Models to Vision-Language Tasks via
  Dynamic Visual Prompting
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting
Shubin Huang
Qiong Wu
Yiyi Zhou
Weijie Chen
Rongsheng Zhang
Xiaoshuai Sun
Rongrong Ji
VLM
VPVLM
LRM
16
0
0
01 Jun 2023
GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task
GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task
Ning Ding
Yehui Tang
Zhongqian Fu
Chaoting Xu
Kai Han
Yunhe Wang
MLLM
VLM
37
2
0
01 Jun 2023
ManagerTower: Aggregating the Insights of Uni-Modal Experts for
  Vision-Language Representation Learning
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Xiao Xu
Bei Li
Chenfei Wu
Shao-Yen Tseng
Anahita Bhiwandiwalla
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
AIFin
VLM
37
2
0
31 May 2023
LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented
  Language Model Prompting
LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting
R. Ramos
Bruno Martins
Desmond Elliott
VLM
13
16
0
31 May 2023
Translation-Enhanced Multilingual Text-to-Image Generation
Translation-Enhanced Multilingual Text-to-Image Generation
Yaoyiran Li
Ching-Yun Chang
Stephen Rawls
Ivan Vulić
Anna Korhonen
21
8
0
30 May 2023
Enhanced Chart Understanding in Vision and Language Task via Cross-modal
  Pre-training on Plot Table Pairs
Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs
Mingyang Zhou
Yi Ren Fung
Long Chen
Christopher Thomas
Heng Ji
Shih-Fu Chang
23
11
0
29 May 2023
Improved Probabilistic Image-Text Representations
Improved Probabilistic Image-Text Representations
Sanghyuk Chun
VLM
33
26
0
29 May 2023
Z-GMOT: Zero-shot Generic Multiple Object Tracking
Z-GMOT: Zero-shot Generic Multiple Object Tracking
Kim Hoang Tran
Anh Duy Le Dinh
Tien-Phat Nguyen
Thinh Phan
Pha Nguyen
Khoa Luu
Don Adjeroh
Gianfranco Doretto
Ngan Hoang Le
VOT
33
5
0
28 May 2023
Learning from Children: Improving Image-Caption Pretraining via
  Curriculum
Learning from Children: Improving Image-Caption Pretraining via Curriculum
Hammad A. Ayyubi
R. Lokesh
Alireza Zareian
Bohong Wu
Shih-Fu Chang
VLM
CLIP
22
1
0
27 May 2023
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating
  Vision-Language Transformers
CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Dachuan Shi
Chaofan Tao
Anyi Rao
Zhendong Yang
Chun Yuan
Jiaqi Wang
VLM
30
22
0
27 May 2023
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
Jannik Kossen
Mark Collier
Basil Mustafa
Tianlin Li
Xiaohua Zhai
Lucas Beyer
Andreas Steiner
Jesse Berent
Rodolphe Jenatton
Efi Kokiopoulou
VLM
42
11
0
26 May 2023
Previous
123...101112...262728
Next