Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1504.00325
Cited By
v1
v2 (latest)
Microsoft COCO Captions: Data Collection and Evaluation Server
1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Microsoft COCO Captions: Data Collection and Evaluation Server"
50 / 1,518 papers shown
Title
Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models
Shamima Hossain
LRM
136
0
0
25 Nov 2025
SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge
Adeel Yousaf
Joseph Fioresi
James Beetham
Amrit Singh Bedi
Mubarak Shah
VLM
140
0
0
20 Nov 2025
PairHuman: A High-Fidelity Photographic Dataset for Customized Dual-Person Generation
Information Fusion (Inf. Fusion), 2025
Ting Pan
Ye Wang
Peiguang Jing
Rui Ma
Zili Yi
Y. Liu
221
0
0
20 Nov 2025
Zero-Training Task-Specific Model Synthesis for Few-Shot Medical Image Classification
Yao Qin
Yangyang Yan
YuanChao Yang
Jinhua Pang
Huanyong Bi
Yuan Liu
HaiHua Wang
MedIm
120
0
0
18 Nov 2025
NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization
Xiyuan Wei
Chih-Jen Lin
Tianbao Yang
VLM
116
0
0
11 Nov 2025
Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation
Lin Li
Chuhan Zhang
Dong Zhang
Chong Sun
Chen Li
L. Chen
120
0
0
08 Nov 2025
Surprisal reveals diversity gaps in image captioning and different scorers change the story
N. Ilinykh
Simon Dobnik
47
0
0
06 Nov 2025
Efficient Test-Time Retrieval Augmented Generation
Hailong Yin
B. Zhu
Yue Yu
Chong-Wah Ngo
RALM
3DV
185
0
0
02 Nov 2025
From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Jianwen Sun
Fanrui Zhang
Yukang Feng
Chuanhao Li
Zizhen Li
Jiaxin Ai
Yifan Chang
Yu Dai
Kaipeng Zhang
89
0
0
31 Oct 2025
Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
Wenli Xiao
Haotian Lin
Andy Peng
Haoru Xue
Tairan He
...
Jimmy Wu
Zhengyi Luo
Linxi Fan
Guanya Shi
Yuke Zhu
VLM
478
4
0
30 Oct 2025
GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
Karim Elmaaroufi
Liheng Lai
Justin Svegliato
Yutong Bai
Sanjit A. Seshia
Matei A. Zaharia
186
0
0
25 Oct 2025
KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution
Junzhe Zhang
Huixuan Zhang
Xiaojun Wan
53
0
0
24 Oct 2025
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Ge Zheng
Jiaye Qian
Jiajin Tang
Sibei Yang
94
2
0
23 Oct 2025
StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback
Jiho Park
Sieun Choi
Jaeyoon Seo
Jihie Kim
DiffM
117
0
0
23 Oct 2025
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin
Alex Jinpeng Wang
Linjie Li
Zhengyuan Yang
Mike Zheng Shou
124
0
0
21 Oct 2025
How Universal Are SAM2 Features?
Masoud Khairi Atani
Alon Harell
Hyomin Choi
Runyu Yang
Fabien Racapé
Ivan V. Bajić
VLM
116
0
0
19 Oct 2025
RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba
Kunyu Peng
Di Wen
Jia Fu
Jiamin Wu
Kailun Yang
...
Yufan Chen
Yuqian Fu
D. Paudel
Luc Van Gool
Rainer Stiefelhagen
113
0
0
18 Oct 2025
Theoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity
Naoki Yoshida
Satoshi Hayakawa
Yuhta Takida
Toshimitsu Uesaka
Hiromi Wakaki
Yuki Mitsufuji
120
0
0
17 Oct 2025
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Gabriel Fiastre
Antoine Yang
Cordelia Schmid
VOS
401
0
0
16 Oct 2025
OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment
Rongjun Chen
Chengsi Yao
Jinchang Ren
Xianxian Zeng
Peixian Wang
Jun Yuan
Jiawen Li
Huimin Zhao
Xu Lu
VLM
121
0
0
15 Oct 2025
Evolution of meta's llama models and parameter-efficient fine-tuning of large language models: a survey
Abdulhady Abas Abdullah
Arkaitz Zubiaga
Seyedali Mirjalili
Amir Gandomi
Fatemeh Daneshfar
Mohammadsadra Amini
Alan Salam Mohammed
Hadi Veisi
ALM
180
0
0
14 Oct 2025
Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
Computer Vision and Pattern Recognition (CVPR), 2023
Rohit Gupta
Anirban Roy
Claire Christensen
Sujeong Kim
Sarah Gerard
Madeline Cincebeaux
Ajay Divakaran
Todd Grindal
M. Shah
140
21
0
13 Oct 2025
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Euhid Aman
Esteban Carlin
Hsing-Kuo Pao
Giovanni Beltrame
Ghaluh Indah Permata Sari
Yie-Tarng Chen
108
0
0
12 Oct 2025
Vision Language Models: A Survey of 26K Papers
Fengming Lin
3DV
VLM
107
0
0
10 Oct 2025
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
Changyao Tian
Hao Li
Gen Luo
Xizhou Zhu
Weijie Su
...
Y. Liu
Lewei Lu
Wenhai Wang
Hongsheng Li
Jifeng Dai
121
1
0
09 Oct 2025
Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications
IEEE Access (IEEE Access), 2025
Kento Kawaharazuka
Jihoon Oh
Jun Yamada
Ingmar Posner
Yuke Zhu
LM&Ro
227
23
0
08 Oct 2025
Automated Repeatable Adversary Threat Emulation with Effects Language (EL)
Suresh Damodaran
Paul D. Rowe
AAML
128
8
0
07 Oct 2025
Uncertainty in Machine Learning
Hans Weytjens
Wouter Verbeke
UD
225
0
0
07 Oct 2025
Visual Representations inside the Language Model
Benlin Liu
Amita Kamath
Madeleine Grunde-McLaughlin
Winson Han
Ranjay Krishna
134
2
0
06 Oct 2025
Activation Steering with a Feedback Controller
Dung V. Nguyen
Hieu M. Vu
Nhi Y. Pham
Lei Zhang
T. Nguyen
LLMSV
191
0
0
05 Oct 2025
Zoom-In to Sort AI-Generated Images Out
Yikun Ji
Y. Hong
Bowen Deng
Jun Lan
Huijia Zhu
Weiqiang Wang
Liqing Zhang
Jianfu Zhang
148
0
0
05 Oct 2025
One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
Lorenzo Bianchi
Giacomo Pacini
F. Carrara
Nicola Messina
Giuseppe Amato
Fabrizio Falchi
VLM
146
0
0
03 Oct 2025
MR
2
^2
2
-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval
Junjie Zhou
Ze Liu
Lei Xiong
Jin-Ge Yao
Yueze Wang
...
Zhicheng Dou
Siqi Bao
Defu Lian
Yongping Xiong
Zheng Liu
VLM
LRM
118
0
0
30 Sep 2025
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
Jiancong Xie
Wenjin Wang
Zhuomeng Zhang
Zihan Liu
Qi Liu
Ke Feng
Zixun Sun
Yuedong Yang
VLM
69
0
0
29 Sep 2025
Bridging the behavior-neural gap: A multimodal AI reveals the brain's geometry of emotion more accurately than human self-reports
Changde Du
Yizhuo Lu
Zhongyu Huang
Yi Sun
Zisen Zhou
Shaozheng Qin
Huiguang He
65
0
0
29 Sep 2025
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
Hitesh Laxmichand Patel
Amit Agarwal
Srikant Panda
Hansa Meghwani
Karan Dua
Paul Li
Tao Sheng
Sujith Ravi
Dan Roth
96
2
0
28 Sep 2025
AutoPrune: Each Complexity Deserves a Pruning Policy
Hanshi Wang
Yuhao Xu
Zekun Xu
Jin Gao
Yufan Liu
Weiming Hu
Ke Wang
Zhipeng Zhang
VLM
128
0
0
28 Sep 2025
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
Amit Agarwal
Hitesh Laxmichand Patel
Srikant Panda
Hansa Meghwani
Jyotika Singh
Karan Dua
Paul Li
Tao Sheng
Sujith Ravi
Dan Roth
LRM
126
3
0
28 Sep 2025
Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea
Jindřich Libovický
VLM
139
1
0
26 Sep 2025
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Ruoyu Chen
Xiaoqing Guo
Kangwei Liu
Siyuan Liang
Shiming Liu
Qunli Zhang
Hua Zhang
Xiaochun Cao
168
0
0
26 Sep 2025
Explaining multimodal LLMs via intra-modal token interactions
Jiawei Liang
Ruoyu Chen
Xianghao Jiao
Siyuan Liang
Shiming Liu
Qunli Zhang
Zheng Hu
Xiaochun Cao
LRM
145
0
0
26 Sep 2025
Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models
Agnieszka Polowczyk
Alicja Polowczyk
Joanna Waczyñska
Piotr Borycki
Przemysław Spurek
152
0
0
26 Sep 2025
OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment
Teng Xiao
Zuchao Li
Lefei Zhang
165
0
0
23 Sep 2025
NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning
Sahil Shah
S P Sharan
Harsh Goel
Minkyu Choi
Mustafa Munir
Manvik Pasula
R. Marculescu
Sandeep Chinchali
NAI
112
1
0
22 Sep 2025
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Jialiang Kang
Han Shu
Wenshuo Li
Yingjie Zhai
Xinghao Chen
MLLM
VLM
318
1
0
17 Sep 2025
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
Tim Lebailly
Vijay Veerabadran
Satwik Kottur
Karl Ridgeway
Michael L. Iuzzolino
VLM
87
0
0
15 Sep 2025
Seeing is Not Understanding: A Benchmark on Perception-Cognition Disparities in Large Language Models
Haokun Li
Yazhou Zhang
Jizhi Ding
Qiuchi Li
Peng Zhang
99
0
0
14 Sep 2025
Towards Meta-Cognitive Knowledge Editing for Multimodal LLMs
Zhaoyu Fan
Kaihang Pan
Mingze Zhou
Bosheng Qin
Juncheng Billy Li
Shengyu Zhang
Wenqiao Zhang
Siliang Tang
Fei Wu
Yueting Zhuang
KELM
132
0
0
06 Sep 2025
STADI: Fine-Grained Step-Patch Diffusion Parallelism for Heterogeneous GPUs
Han Liang
Jiahui Zhou
Zicheng Zhou
Xiaoxi Zhang
Xu Chen
DiffM
155
1
0
05 Sep 2025
Aesthetic Image Captioning with Saliency Enhanced MLLMs
Yilin Tao
Jiashui Huang
Huaze Xu
Ling Shao
237
0
0
04 Sep 2025
1
2
3
4
...
29
30
31
Next