ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1504.00325
  4. Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
ArXiv (abs)PDFHTML

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown
TRAP: Targeted Redirecting of Agentic Preferences
TRAP: Targeted Redirecting of Agentic Preferences
Hangoo Kang
Jehyeok Yeon
Gagandeep Singh
AAML
307
2
0
29 May 2025
Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration
Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration
Wenju Sun
Qingyong Li
Wen Wang
Yang Liu
Yangli-ao Geng
Boyang Li
MoMe
312
3
0
29 May 2025
Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions
Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions
Yijun Shen
Delong Chen
Fan Liu
Xingyu Wang
Chuanyi Zhang
Liang Yao
Yuhui Zheng
295
1
0
28 May 2025
Vision Transformers with Self-Distilled Registers
Vision Transformers with Self-Distilled Registers
Yinjie Chen
Zipeng Yan
Chong Zhou
Bo Dai
Andrew F. Luo
473
4
0
27 May 2025
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
Yufei Zhan
Hongyin Zhao
Yousong Zhu
Shurong Zheng
Fan Yang
Ming Tang
Jinqiao Wang
VLMLRM
282
1
0
27 May 2025
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
Minheng Ni
Zhengyuan Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
W. Zuo
Lijuan Wang
ReLMLRM
306
12
0
26 May 2025
From Data to Modeling: Fully Open-vocabulary Scene Graph Generation
From Data to Modeling: Fully Open-vocabulary Scene Graph Generation
Zuyao Chen
Jinlin Wu
Zhen Lei
Chang Wen Chen
183
1
0
26 May 2025
TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP
TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP
Yuliang Cai
Jesse Thomason
Mohammad Rostami
VLM
239
0
0
24 May 2025
Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion
Jacob A. Hansen
Wei Lin
Junmo Kang
M. Jehanzeb Mirza
Hongyin Luo
Rogerio Feris
Alan Ritter
James R. Glass
Leonid Karlinsky
VLM
447
1
0
23 May 2025
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Donghwan Chi
Hyomin Kim
Yoonjin Oh
Yongjin Kim
Donghoon Lee
DaeJin Jo
Jongmin Kim
Junyeob Baek
Sungjin Ahn
Sungwoong Kim
MLLMVLM
923
0
0
23 May 2025
Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives
Vision-Language Modeling Meets Remote Sensing: Models, Datasets and PerspectivesIEEE Geoscience and Remote Sensing Magazine (GRSM), 2025
Xingxing Weng
Chao Pang
Gui-Song Xia
VLM
398
12
0
20 May 2025
Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption
Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption
Kazuki Adachi
Shin'ya Yamaguchi
Tomoki Hamagami
VLM
302
1
0
19 May 2025
Visuospatial Cognitive Assistant
Visuospatial Cognitive Assistant
Qi Feng
LM&Ro
347
2
0
18 May 2025
Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models
Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models
Aryan Das
Tanishq Rachamalla
Pravendra Singh
Koushik Biswas
Vinay Kumar Verma
Swalpa Kumar Roy
VLM
242
2
0
18 May 2025
Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Tobias Jan Wieczorek
Nathalie Daun
Mohammad Emtiyaz Khan
Marcus Rohrbach
OOD
570
0
0
14 May 2025
CAT Merging: A Training-Free Approach for Resolving Conflicts in Model Merging
CAT Merging: A Training-Free Approach for Resolving Conflicts in Model Merging
Wenju Sun
Qingyong Li
Yangli-ao Geng
Boyang Li
MoMe
330
8
0
11 May 2025
DriveSOTIF: Advancing Perception SOTIF Through Multimodal Large Language Models
DriveSOTIF: Advancing Perception SOTIF Through Multimodal Large Language Models
Shucheng Huang
Freda Shi
Chen Sun
Jiaming Zhong
Minghao Ning
Yufeng Yang
Yukun Lu
Hong Wang
A. Khajepour
463
0
0
11 May 2025
Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding
Emotion-Qwen: A Unified Framework for Emotion and Vision Understanding
Dawei Huang
Qing Li
Chuan Yan
Minghan Li
Jiaming Ji
...
Xiaobei Wang
X. Wang
Zheng Lian
Zhi-Qi Cheng
Xiaojiang Peng
316
1
0
10 May 2025
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable ModelsConference on Uncertainty in Artificial Intelligence (UAI), 2025
Aishwarya Venkataramanan
P. Bodesheim
Joachim Denzler
BDLVLM
418
3
0
08 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Jiabo He
James Bailey
AAML
486
9
0
08 May 2025
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin
Teng Wang
Yixiao Ge
Yuying Ge
Zhichao Lu
Ying Wei
Gang Qu
Zhenan Sun
Mingyu Ding
MLLMVLM
447
33
0
08 May 2025
Multi-Agent System for Comprehensive Soccer Understanding
Multi-Agent System for Comprehensive Soccer Understanding
Jiayuan Rao
Zhiyu Li
Haoning Wu
Yujiao Shi
Yanfeng Wang
Weidi Xie
LLMAG
387
7
0
06 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
1.2K
35
0
05 May 2025
Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs
Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMsSocial Science Research Network (SSRN), 2025
Dongxing Yu
256
1
0
03 May 2025
Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language EncodersInternational Conference on Text, Speech and Dialogue (TSD), 2025
Andrei-Alexandru Manea
Jindřich Libovický
VLM
394
1
0
30 Apr 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
366
2
0
30 Apr 2025
Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition
Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition
Yuki Hirakawa
Ryotaro Shimizu
282
0
0
28 Apr 2025
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning
Run Luo
Renke Shan
Longze Chen
Ziqiang Liu
Lu Wang
Min Yang
Xiaobo Xia
MLLMVLM
529
3
0
28 Apr 2025
Revisiting Data Auditing in Large Vision-Language Models
Revisiting Data Auditing in Large Vision-Language Models
Hongyu Zhu
Sichu Liang
Wenjie Wang
Boheng Li
Tongxin Yuan
Fangqi Li
Shilin Wang
Zhuosheng Zhang
VLM
1.1K
3
0
25 Apr 2025
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Zehao Wang
Senthil Purushwalkam
Caiming Xiong
Siyang Song
Chenhui Xu
Ran Xu
408
6
0
23 Apr 2025
Decoupled Global-Local Alignment for Improving Compositional Understanding
Decoupled Global-Local Alignment for Improving Compositional Understanding
Xiaoxing Hu
Kaicheng Yang
Chao Guo
Haoran Xu
Ziyong Feng
Longji Xu
VLM
709
8
0
23 Apr 2025
Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models
Structure-Preserving Zero-Shot Image Editing via Stage-Wise Latent Injection in Diffusion Models
Dasol Jeong
Donggoo Kang
Jiwon Park
Hyebean Lee
Joonki Paik
DiffM
375
0
0
22 Apr 2025
$π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
π0.5π_{0.5}π0.5​: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence
Kevin Black
Noah Brown
James Darpinian
Karan Dhabalia
...
Homer Walke
Anna Walling
Haohuan Wang
Lili Yu
Ury Zhilinsky
LM&RoVLM
8.2K
431
0
22 Apr 2025
A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
Kyle Buettner
Jacob Emmerson
Adriana Kovashka
168
0
0
19 Apr 2025
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Yikun Ji
Y. Hong
Jiahui Zhan
H. Chen
Jun Lan
Huijia Zhu
Weiqiang Wang
Guang Dai
Jianfu Zhang
MLLMLRM
523
4
0
19 Apr 2025
Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models
Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models
Chung-En
Hsuan-Chih
Chen
Brian Jalaian
Nathaniel D. Bastian
AAMLVLM
296
1
0
19 Apr 2025
Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator
Akshat Ramachandran
Souvik Kundu
Arnab Raha
Shamik Kundu
Deepak K. Mathaikutty
Tushar Krishna
361
5
0
19 Apr 2025
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation
Yongchao Feng
Yajie Liu
Shuai Yang
Wenrui Cai
Jing Zhang
...
Jiahui Lv
Ziqiang Liu
Tengyuan Shi
Qingjie Liu
Longji Xu
MLLMVLM
322
11
0
13 Apr 2025
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena
Tommaso Apicella
Stefano Rosa
Pietro Morerio
Alessio Del Bue
Lorenzo Natale
394
1
0
11 Apr 2025
Mimic In-Context Learning for Multimodal Tasks
Mimic In-Context Learning for Multimodal TasksComputer Vision and Pattern Recognition (CVPR), 2025
Yuchu Jiang
Jiale Fu
Chenduo Hao
Xinting Hu
Yingzhe Peng
Xin Geng
Xu Yang
348
9
0
11 Apr 2025
Perception in Reflection
Perception in Reflection
Yana Wei
Liang Zhao
Kangheng Lin
En Yu
Yuang Peng
...
Jianjian Sun
Haoran Wei
Zheng Ge
Xiangyu Zhang
Vishal M. Patel
339
7
0
09 Apr 2025
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
Xiangxi Zheng
Linjie Li
Zhiyong Yang
Ping Yu
Alex Jinpeng Wang
Rui Yan
Yuan Yao
Lijuan Wang
LRM
363
3
0
08 Apr 2025
Spingarn's Method and Progressive Decoupling Beyond Elicitable Monotonicity
Spingarn's Method and Progressive Decoupling Beyond Elicitable Monotonicity
B. Evens
P. Latafat
Panagiotis Patrinos
427
0
0
01 Apr 2025
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?
XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?Computer Vision and Pattern Recognition (CVPR), 2025
Fengxiang Wang
Hongru Wang
Mingshuo Chen
Haiyan Zhao
Yulin Wang
...
L. Lan
Wenjing Yang
Jing Zhang
Zhiyuan Liu
Maosong Sun
338
24
0
31 Mar 2025
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Ziyue Huang
Hongxi Yan
Qiqi Zhan
Shuai Yang
Mingming Zhang
Yiming Lei
Chenkai Zhang
Zeming Liu
Qingjie Liu
Longji Xu
384
11
0
28 Mar 2025
Evaluating Text-to-Image and Text-to-Video Synthesis with a Conditional Fréchet Distance
Evaluating Text-to-Image and Text-to-Video Synthesis with a Conditional Fréchet Distance
Jaywon Koo
J. Hernandez
Moayed Haji-Ali
Ziyan Yang
Vicente Ordonez
EGVM
339
0
0
27 Mar 2025
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu
Feiyu Xiong
Lumin Xu
Sheng Jin
Zhonghua Wu
Qingyi Tao
Wentao Liu
Wei Li
Chen Change Loy
VGen
988
37
0
27 Mar 2025
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy
Yinan Sun
Xiongkuo Min
Zicheng Zhang
Yixuan Gao
Yuhang Cao
Guoquan Zheng
VLM
306
2
0
26 Mar 2025
Unified Multimodal Discrete Diffusion
Unified Multimodal Discrete Diffusion
Alexander Swerdlow
Mihir Prabhudesai
Siddharth Gandhi
Deepak Pathak
Katerina Fragkiadaki
DiffM
345
24
0
26 Mar 2025
Gemma 3 Technical Report
Gemma 3 Technical Report
Gemma Team
Aishwarya B Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
...
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
VLM
580
868
0
25 Mar 2025
Previous
123456...293031
Next
Page 3 of 31
Pageof 31