Consistent prompt learning for vision-language models

Yonggang Zhang, Xinmei Tian*

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Pre-trained vision-language models, such as CLIP, have shown remarkable capabilities across various downstream tasks by learning prompts that consist of context concatenated with a class name; for example, ‘a photo of a [dog]’ with [dog] as a class prior. Advanced prompt-learning methods typically initialize and optimize the context; for example, ‘a photo of a’ for downstream task adaptation. However, context optimization typically leads to poor generalization performance over novel classes or datasets sampled from different distributions. This may be attributed to prompt inconsistency; namely, prompts optimized using one image distribution may differ from those optimized using a different image distribution. To improve the generalization performance of optimized prompts, we propose the novel consistent prompt learning (CPL) approach that identifies and addresses the image distribution that causes prompt inconsistency by performing distributional exploration. CPL identifies and mitigates prompt inconsistency in an adversarial training scheme, in which prompt inconsistency is measured as the similarity discrepancy between images and two different prompts. Specifically, CPL calculates two similarities between a query image and two prompts, and determines the prompt inconsistency through the discrepancy between these two similarities. Subsequently, CPL performs distributional exploration to enlarge the discrepancy and uses an adversarial-training approach to mitigate the discrepancy. Consequently, the model predictions are insensitive to prompt changes. The optimized prompt performs well under various image distributions. Comprehensive experiments show that the proposed CPL method performs favorably on four types of representative tasks across 11 datasets, which improves on existing prompt-learning methods, achieving state-of-the-art performance.

Original languageEnglish
Article number112974
Number of pages9
JournalKnowledge-Based Systems
Volume310
DOIs
Publication statusPublished - 15 Feb 2025

Scopus Subject Areas

  • Software
  • Management Information Systems
  • Information Systems and Management
  • Artificial Intelligence

User-Defined Keywords

  • Domain adaptation
  • Domain generalization
  • Prompt learning
  • Vision-language models

Fingerprint

Dive into the research topics of 'Consistent prompt learning for vision-language models'. Together they form a unique fingerprint.

Cite this