TY - JOUR
T1 - Supplementary Prompt Learning for Vision-Language Models
AU - Zeng, Rongfei
AU - Yang, Zhipeng
AU - Yu, Ruiyun
AU - Zhang, Yonggang
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/5/24
Y1 - 2025/5/24
N2 - Pre-trained vision-language models like CLIP have shown remarkable capabilities across various downstream tasks with well-tuned prompts. Advanced methods tune prompts by optimizing context while keeping the class name fixed, implicitly assuming that the class names in prompts are accurate and not missing. However, this assumption may be violated in numerous real-world scenarios, leading to potential performance degeneration or even failure of existing prompt learning methods. For example, an accurate class name for an image containing “Transformers” might be inaccurate because selecting a precise class name among numerous candidates is challenging. Moreover, assigning class names to some images may require specialized knowledge, resulting in indexing rather than semantic labels, e.g., Group 3 and Group 4 subtypes of medulloblastoma. To cope with the class-name missing issue, we propose a simple yet effective prompt learning approach, called Supplementary Optimization (SOp) for supplementing the missing class-related information. Specifically, SOp models the class names as learnable vectors while keeping the context fixed to learn prompts for downstream tasks. Extensive experiments across 18 public datasets demonstrate the efficacy of SOp when class names are missing. SOp can achieve performance comparable to that of the context optimization approach, even without using the prior information in the class names.
AB - Pre-trained vision-language models like CLIP have shown remarkable capabilities across various downstream tasks with well-tuned prompts. Advanced methods tune prompts by optimizing context while keeping the class name fixed, implicitly assuming that the class names in prompts are accurate and not missing. However, this assumption may be violated in numerous real-world scenarios, leading to potential performance degeneration or even failure of existing prompt learning methods. For example, an accurate class name for an image containing “Transformers” might be inaccurate because selecting a precise class name among numerous candidates is challenging. Moreover, assigning class names to some images may require specialized knowledge, resulting in indexing rather than semantic labels, e.g., Group 3 and Group 4 subtypes of medulloblastoma. To cope with the class-name missing issue, we propose a simple yet effective prompt learning approach, called Supplementary Optimization (SOp) for supplementing the missing class-related information. Specifically, SOp models the class names as learnable vectors while keeping the context fixed to learn prompts for downstream tasks. Extensive experiments across 18 public datasets demonstrate the efficacy of SOp when class names are missing. SOp can achieve performance comparable to that of the context optimization approach, even without using the prior information in the class names.
KW - Multi-modal
KW - Prompt learning
KW - Vision-language models
UR - http://www.scopus.com/inward/record.url?scp=105006408063&partnerID=8YFLogxK
U2 - 10.1007/s11263-025-02451-1
DO - 10.1007/s11263-025-02451-1
M3 - Journal article
AN - SCOPUS:105006408063
SN - 0920-5691
JO - International Journal of Computer Vision
JF - International Journal of Computer Vision
ER -