ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXiv (abs)PDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown
MedImageInsight: An Open-Source Embedding Model for General Domain
  Medical Imaging
MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging
Noel C. F. Codella
Ying Jin
Shrey Jain
Yu Gu
Ho Hin Lee
...
Lei Li
Thomas Lin
Ivan Tarapov
M. Lungren
Mu-Hsin Wei
LM&MAVLMMedIm
315
32
0
09 Oct 2024
$M^3EL$: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking
M3ELM^3ELM3EL: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking
Fang Wang
Shenglin Yin
Xiaoying Bai
Minghao Hu
Tianwei Yan
Yi Liang
VLM
247
1
0
08 Oct 2024
SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in
  Open-Vocabulary Detection
SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary DetectionACM Multimedia (MM), 2024
Zishuo Wang
Wenhao Zhou
Jinglin Xu
Yuxin Peng
ObjDVLM
208
7
0
08 Oct 2024
Precise Model Benchmarking with Only a Few Observations
Precise Model Benchmarking with Only a Few ObservationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Riccardo Fogliato
Pratik Patil
Nil-Jana Akpinar
Mathew Monfort
209
1
0
07 Oct 2024
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving
  Vision-Linguistic Compositionality
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic CompositionalityConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Youngtaek Oh
Jae-Won Cho
Dong-Jin Kim
In So Kweon
Junmo Kim
VLMCoGeCLIP
343
11
0
07 Oct 2024
MM-R$^3$: On (In-)Consistency of Vision-Language Models (VLMs)
MM-R3^33: On (In-)Consistency of Vision-Language Models (VLMs)
Shih-Han Chou
Shivam Chandhok
James J. Little
Leonid Sigal
289
0
0
07 Oct 2024
VEDIT: Latent Prediction Architecture For Procedural Video
  Representation Learning
VEDIT: Latent Prediction Architecture For Procedural Video Representation LearningInternational Conference on Learning Representations (ICLR), 2024
Han Lin
Tushar Nagarajan
Nicolas Ballas
Mido Assran
Mojtaba Komeili
Joey Tianyi Zhou
Koustuv Sinha
AI4TS
300
7
0
04 Oct 2024
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel
  Approach using Gloss-based Annotation
Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation
Sen Fang
Sizhou Chen
Yalin Feng
Xiaofeng Zhang
T. Teoh
171
0
0
04 Oct 2024
Toward a Holistic Evaluation of Robustness in CLIP Models
Toward a Holistic Evaluation of Robustness in CLIP ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Weijie Tu
Weijian Deng
Tom Gedeon
VLM
349
7
0
02 Oct 2024
ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art
Qi Jia
Xiang Yue
Shanshan Huang
Ziheng Qin
Yizhu Liu
Bill Yuchen Lin
Yang You
Guangtao Zhai
VLM
247
2
0
02 Oct 2024
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang
Mingfei Gao
Zhe Gan
Philipp Dufter
Nina Wenzel
...
Haoxuan You
Zirui Wang
Afshin Dehghan
Peter Grasch
Yinfei Yang
VLMMLLM
303
66
1
30 Sep 2024
Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval
Multimodal LLM Enhanced Cross-lingual Cross-modal RetrievalACM Multimedia (MM), 2024
Yabing Wang
Le Wang
Qiang-feng Zhou
Zhibin Wang
Hao Li
Gang Hua
Wei Tang
222
21
0
30 Sep 2024
Efficient Backdoor Defense in Multimodal Contrastive Learning: A
  Token-Level Unlearning Method for Mitigating Threats
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats
Kuanrong Liu
Yaning Tan
Jiawei Liang
Pengwen Dai
Xiaochun Cao
MUAAML
273
3
0
29 Sep 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on
  Comprehensive Long Video Understanding
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
299
19
0
27 Sep 2024
Emu3: Next-Token Prediction is All You Need
Emu3: Next-Token Prediction is All You Need
Xinlong Wang
Xiaosong Zhang
Zhengxiong Luo
Quan-Sen Sun
Yufeng Cui
...
Xi Yang
Jingjing Liu
Yonghua Lin
Tiejun Huang
Zhongyuan Wang
MLLM
290
483
0
27 Sep 2024
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for
  Zero-shot Captioning
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot CaptioningConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Soeun Lee
Si-Woo Kim
Taewhan Kim
Dong-Jin Kim
CLIPVLM
217
6
0
26 Sep 2024
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
  Multimodal Models
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Matt Deitke
Christopher Clark
Sangho Lee
Rohun Tripathi
Yue Yang
...
Noah A. Smith
Hannaneh Hajishirzi
Ross Girshick
Ali Farhadi
Aniruddha Kembhavi
OSLMVLM
457
58
0
25 Sep 2024
Understanding the Cognitive Complexity in Language Elicited by Product
  Images
Understanding the Cognitive Complexity in Language Elicited by Product Images
Yan-Ying Chen
Shabnam Hakimi
Monica P Van
Francine Chen
Matthew K. Hong
M. Klenk
Charlene C. Wu
255
1
0
25 Sep 2024
Enhancing Advanced Visual Reasoning Ability of Large Language Models
Enhancing Advanced Visual Reasoning Ability of Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zhiyuan Li
Dongnan Liu
Chaoyi Zhang
Heng Wang
Tengfei Xue
Weidong Cai
VLMLRM
259
17
0
21 Sep 2024
Instruction-guided Multi-Granularity Segmentation and Captioning with
  Large Multimodal Model
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model
Li Zhou
Xu Yuan
Zenghui Sun
Zikun Zhou
Jingsong Lan
VLMMLLM
861
7
0
20 Sep 2024
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images
JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated ImagesNeural Information Processing Systems (NeurIPS), 2024
Zhecan Wang
Junzhang Liu
Chia-Wei Tang
Hani Alomari
Anushka Sivakumar
...
Haoxuan You
A. Ishmam
Kai-Wei Chang
Shih-Fu Chang
Chris Thomas
CoGeVLM
505
5
0
19 Sep 2024
OneEncoder: A Lightweight Framework for Progressive Alignment of
  Modalities
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Hanane Azzag
Hanane Azzag
M. Lebbah
ObjD
349
2
0
17 Sep 2024
Benchmarking VLMs' Reasoning About Persuasive Atypical Images
Benchmarking VLMs' Reasoning About Persuasive Atypical ImagesIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Sina Malakouti
Aysan Aghazadeh
Ashmit Khandelwal
Adriana Kovashka
VLM
378
4
0
16 Sep 2024
Evaluating authenticity and quality of image captions via sentiment and
  semantic analyses
Evaluating authenticity and quality of image captions via sentiment and semantic analyses
Aleksei Krotov
Alison Tebo
Dylan K. Picart
Aaron Dean Algave
128
1
0
14 Sep 2024
Guiding Vision-Language Model Selection for Visual Question-Answering
  Across Tasks, Domains, and Knowledge Types
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha
Vinija Jain
Vasu Sharma
187
13
0
14 Sep 2024
Alignment of Diffusion Models: Fundamentals, Challenges, and Future
Alignment of Diffusion Models: Fundamentals, Challenges, and Future
Buhua Liu
Shitong Shao
Bao Li
Lichen Bai
Zhiqiang Xu
Haoyi Xiong
James Kwok
Sumi Helal
Bo Han
463
22
0
11 Sep 2024
FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary
  Segmentation
FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation
Xi Chen
Haosen Yang
Sheng Jin
Xiatian Zhu
Huanjin Yao
VLM
244
6
0
05 Sep 2024
A New People-Object Interaction Dataset and NVS Benchmarks
A New People-Object Interaction Dataset and NVS BenchmarksInternational Conference on Information Photonics (ICIP), 2024
Shuai Guo
Houqiang Zhong
Qi Wang
Ziyu Chen
Yijie Gao
Jiajing Yuan
Chenyu Zhang
Rong Xie
Li Song
268
1
0
03 Sep 2024
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal
  Models
Blocks as Probes: Dissecting Categorization Ability of Large Multimodal ModelsBritish Machine Vision Conference (BMVC), 2024
Bin Fu
Qiyang Wan
Jialin Li
Ruiping Wang
Xilin Chen
150
1
0
03 Sep 2024
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio
  Captioning
Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Minjeong Jeon
Sang Hoon Woo
Jinjoo Lee
172
1
0
02 Sep 2024
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding
  Data
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data
Spencer Whitehead
Jacob Phillips
Sean Hendryx
183
0
0
30 Aug 2024
Image-Perfect Imperfections: Safety, Bias, and Authenticity in the
  Shadow of Text-To-Image Model Evolution
Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model EvolutionConference on Computer and Communications Security (CCS), 2024
Yixin Wu
Yun Shen
Michael Backes
Yang Zhang
263
7
0
30 Aug 2024
A Survey on Evaluation of Multimodal Large Language Models
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MAELMLRM
305
42
0
28 Aug 2024
Probing the Robustness of Vision-Language Pretrained Models: A
  Multimodal Adversarial Attack Approach
Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach
Jiwei Guan
Tianyu Ding
Longbing Cao
Lei Pan
Chen Wang
Xi Zheng
AAML
287
3
0
24 Aug 2024
ParGo: Bridging Vision-Language with Partial and Global Views
ParGo: Bridging Vision-Language with Partial and Global ViewsAAAI Conference on Artificial Intelligence (AAAI), 2024
An-Lan Wang
Bin Shan
Wei Shi
Kun-Yu Lin
Xiang Fei
Guozhi Tang
Lei Liao
Jingqun Tang
Can Huang
Wei-Shi Zheng
MLLMVLM
519
23
0
23 Aug 2024
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs
Yuanyang Yin
Yaqi Zhao
Yajie Zhang
Yuanxing Zhang
Ke Lin
Jiahao Wang
Pengfei Wan
Di Zhang
Baoqun Yin
Wentao Zhang
LRM
332
11
0
21 Aug 2024
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEdit
Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEditAAAI Conference on Artificial Intelligence (AAAI), 2024
Qizhou Chen
Taolin Zhang
Chengyu Wang
Xiaofeng He
Dakan Wang
Tingting Liu
KELM
695
5
0
19 Aug 2024
Quality Assessment in the Era of Large Models: A Survey
Quality Assessment in the Era of Large Models: A Survey
Zicheng Zhang
Yingjie Zhou
Chunyi Li
Baixuan Zhao
Xiaohong Liu
Guangtao Zhai
344
33
0
17 Aug 2024
Can Large Language Models Understand Symbolic Graphics Programs?
Can Large Language Models Understand Symbolic Graphics Programs?International Conference on Learning Representations (ICLR), 2024
Zeju Qiu
Weiyang Liu
Haiwen Feng
Zhen Liu
Tim Z. Xiao
Katherine M. Collins
J. Tenenbaum
Adrian Weller
Michael J. Black
Bernhard Schölkopf
602
28
0
15 Aug 2024
Efficient and Versatile Robust Fine-Tuning of Zero-shot Models
Efficient and Versatile Robust Fine-Tuning of Zero-shot ModelsEuropean Conference on Computer Vision (ECCV), 2024
Sungyeon Kim
Boseung Jeong
Donghyun Kim
Suha Kwak
VLM
232
9
0
11 Aug 2024
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language
  Modeling
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language ModelingEuropean Conference on Computer Vision (ECCV), 2024
William Y. Zhu
Keren Ye
Junjie Ke
Jiahui Yu
Leonidas Guibas
P. Milanfar
Feng Yang
341
2
0
07 Aug 2024
Attacks and Defenses for Generative Diffusion Models: A Comprehensive
  Survey
Attacks and Defenses for Generative Diffusion Models: A Comprehensive SurveyACM Computing Surveys (ACM CSUR), 2024
V. T. Truong
Luan Ba Dang
Long Bao Le
DiffMMedIm
341
45
0
06 Aug 2024
GazeXplain: Learning to Predict Natural Language Explanations of Visual
  Scanpaths
GazeXplain: Learning to Predict Natural Language Explanations of Visual ScanpathsEuropean Conference on Computer Vision (ECCV), 2024
Xianyu Chen
Ming Jiang
Qi Zhao
213
8
0
05 Aug 2024
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical KnowledgeIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Zihan Li
Diping Song
Zefeng Yang
Deming Wang
Fei Li
Xiulan Zhang
P. E. Kinahan
Yu Qiao
VLMLM&MA
329
20
0
05 Aug 2024
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks
Juhwan Choi
Junehyoung Kwon
Jungmin Yun
Seunguk Yu
Youngbin Kim
309
3
0
29 Jul 2024
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross
  Modal Retrieval
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval
Zeyu Chen
Pengfei Zhang
Kai Ye
Wei Dong
Xin Feng
Yana Zhang
225
1
0
28 Jul 2024
LLAVADI: What Matters For Multimodal Large Language Models Distillation
LLAVADI: What Matters For Multimodal Large Language Models Distillation
Shilin Xu
Xiangtai Li
Haobo Yuan
Lu Qi
Yunhai Tong
Ming-Hsuan Yang
216
15
0
28 Jul 2024
SWIFT: Semantic Watermarking for Image Forgery Thwarting
SWIFT: Semantic Watermarking for Image Forgery Thwarting
Gautier Evennou
Vivien Chappelier
Ewa Kijak
Teddy Furon
245
6
0
26 Jul 2024
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
Jihyung Kil
Zheda Mai
Justin Lee
Zihe Wang
Kerrie Cheng
Jingyan Bai
Ye Liu
A. Chowdhury
Wei-Lun Chao
CoGeVLM
345
1
0
23 Jul 2024
Multimodal Unlearnable Examples: Protecting Data against Multimodal
  Contrastive Learning
Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning
Xinwei Liu
Yang Liu
Yuan Xun
Yaning Tan
Simeng Qin
283
13
0
23 Jul 2024
Previous
123...567...293031
Next