Abstract
Contrastive Vision-Language Models (VLMs) have emerged as powerful tools, excelling in various open-vocabulary tasks such as image recognition, retrieval-augmented task adaptation, and visual chatbots. To better adapt to downstream tasks, various parameter-efficient fine-tuning approaches have been developed by the community, e.g., prompt learning. However, an important issue has received little attention: the confidence calibration problem in zero-shot or fine-tuned VLMs, which can significantly undermine the reliability of these models in downstream applications. This chapter addresses this issue by systematically studying the confidence calibration problem in the context of prompt learning for CLIP. The analysis reveals that existing calibration techniques are inadequate, particularly in open-vocabulary scenarios. This chapter then discusses a simple yet effective approach called Distance-Aware Calibration (DAC), which automatically adjusts the temperature scaling parameter based on the distance between predicted text labels and base classes. The effectiveness of the approach is validated on 7 prompt learning methods across 11 downstream tasks.
| Original language | English |
|---|---|
| Title of host publication | Large Vision-Language Models |
| Subtitle of host publication | Pre-training, Prompting, and Applications |
| Editors | Kaiyang Zhou, Ziwei Liu, Peng Gao |
| Place of Publication | Cham |
| Publisher | Springer Cham |
| Chapter | 9 |
| Pages | 207-226 |
| Number of pages | 20 |
| ISBN (Electronic) | 9783031949692 |
| ISBN (Print) | 9783031949685, 9783031949715 |
| DOIs | |
| Publication status | Published - 30 Aug 2025 |
Publication series
| Name | Advances in Computer Vision and Pattern Recognition |
|---|---|
| Volume | Part F886 |
| ISSN (Print) | 2191-6586 |
| ISSN (Electronic) | 2191-6594 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 9 Industry, Innovation, and Infrastructure
User-Defined Keywords
- Confidence calibration
- Prompt learning
- Vision-language model
Fingerprint
Dive into the research topics of 'Confidence Calibration in Contrastive Vision-Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver