Abstract
Large pre-trained vision-language models like CLIP have demonstrated significant potential for learning representations that can be applied across a variety of downstream tasks. Unlike traditional representation learning, which primarily relies on labeled data, vision-language pre-training aligns images and text within a shared feature space. This alignment allows for zero-shot transfer to downstream tasks through prompting, where classification weights are generated from natural language descriptions of the target classes. However, a major hurdle in deploying these models is prompt engineering, which is time-consuming and requires substantial domain expertise. Small changes in wording can significantly affect performance, making the process labor-intensive. In this chapter, we discuss a simple yet effective approach called Context Optimization (CoOp) for adapting CLIP-like vision-language models for image recognition tasks. CoOp uses learnable vectors to model the context words of prompts while keeping the pre-trained model parameters fixed. On 11 benchmark datasets, CoOp outperforms hand-crafted prompts with as few as one or two examples. Despite being a learning-based approach, CoOp also exhibits excellent domain generalization, surpassing zero-shot models that use hand-crafted prompts.
| Original language | English |
|---|---|
| Title of host publication | Large Vision-Language Models |
| Subtitle of host publication | Pre-training, Prompting, and Applications |
| Editors | Kaiyang Zhou, Ziwei Liu, Peng Gao |
| Place of Publication | Cham |
| Publisher | Springer Cham |
| Chapter | 5 |
| Pages | 115-133 |
| Number of pages | 19 |
| ISBN (Electronic) | 9783031949692 |
| ISBN (Print) | 9783031949685, 9783031949715 |
| DOIs | |
| Publication status | Published - 30 Aug 2025 |
Publication series
| Name | Advances in Computer Vision and Pattern Recognition |
|---|---|
| Volume | Part F886 |
| ISSN (Print) | 2191-6586 |
| ISSN (Electronic) | 2191-6594 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 9 Industry, Innovation, and Infrastructure
User-Defined Keywords
- Contrastive Learning
- Domain Generalization
- Few-Shot Learning
- Image Classification
- Prompt Engineering
- Prompt Learning
- Robustness
- Vision-Language Model
- Zero-Shot Learning
Fingerprint
Dive into the research topics of 'Differentiable Prompt Learning for Vision-Language Models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver