Towards Calibrating Prompt Tuning of Vision-Language Models

22 February 2026

Ashshak Sharifdeen

Fahad Shamshad

Muhammad Akhtar Munir

Abhishek Basu

Mohamed Insaf Ismithdeen

Jeyapriyan Jeyamohan

Chathurika Sewwandi Silva

Karthik Nandakumar

Muhammad Haris Khan

VLM

UQCV

ArXiv (abs)PDF HTML Github (1★)

Main:8 Pages

9 Figures

Bibliography:2 Pages

18 Tables

Appendix:6 Pages

Abstract

Prompt tuning of large-scale vision-language models such as CLIP enables efficient task adaptation without updating model weights. However, it often leads to poor confidence calibration and unreliable predictive uncertainty. We address this problem by proposing a calibration framework that enhances predictive reliability while preserving the geometry of the pretrained CLIP embedding space, which is required for robust generalization. Our approach extends the standard cross-entropy loss with two complementary regularizers: (1) a mean-variance margin penalty that stabilizes inter-class logit margins by maximizing their average while minimizing dispersion, mitigating underconfidence and overconfidence spikes; and (2) a text moment-matching loss that aligns the first and second moments of tuned text embeddings with their frozen CLIP counterparts, preserving semantic dispersion crucial for generalization. Through extensive experiments across 7 prompt-tuning methods and 11 diverse datasets, we demonstrate that our approach significantly reduces the Expected Calibration Error (ECE) compared to competitive calibration techniques on both base and novel classes

View on arXiv

Comments on this paper