Supplementary Prompt Learning for Vision-Language Models

Rongfei Zeng, Zhipeng Yang, Ruiyun Yu, Yonggang Zhang*

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Pre-trained vision-language models like CLIP have shown remarkable capabilities across various downstream tasks with well-tuned prompts. Advanced methods tune prompts by optimizing context while keeping the class name fixed, implicitly assuming that the class names in prompts are accurate and not missing. However, this assumption may be violated in numerous real-world scenarios, leading to potential performance degeneration or even failure of existing prompt learning methods. For example, an accurate class name for an image containing “Transformers” might be inaccurate because selecting a precise class name among numerous candidates is challenging. Moreover, assigning class names to some images may require specialized knowledge, resulting in indexing rather than semantic labels, e.g., Group 3 and Group 4 subtypes of medulloblastoma. To cope with the class-name missing issue, we propose a simple yet effective prompt learning approach, called Supplementary Optimization (SOp) for supplementing the missing class-related information. Specifically, SOp models the class names as learnable vectors while keeping the context fixed to learn prompts for downstream tasks. Extensive experiments across 18 public datasets demonstrate the efficacy of SOp when class names are missing. SOp can achieve performance comparable to that of the context optimization approach, even without using the prior information in the class names.

Original languageEnglish
Number of pages18
JournalInternational Journal of Computer Vision
DOIs
Publication statusE-pub ahead of print - 24 May 2025

User-Defined Keywords

  • Multi-modal
  • Prompt learning
  • Vision-language models

Fingerprint

Dive into the research topics of 'Supplementary Prompt Learning for Vision-Language Models'. Together they form a unique fingerprint.

Cite this