Skip to main navigation Skip to search Skip to main content

Differentiable Prompt Learning for Vision-Language Models

  • Kaiyang Zhou*
  • , Jingkang Yang
  • , Chen Change Loy
  • , Ziwei Liu
  • *Corresponding author for this work

Research output: Chapter in book/report/conference proceedingChapterpeer-review

Abstract

Large pre-trained vision-language models like CLIP have demonstrated significant potential for learning representations that can be applied across a variety of downstream tasks. Unlike traditional representation learning, which primarily relies on labeled data, vision-language pre-training aligns images and text within a shared feature space. This alignment allows for zero-shot transfer to downstream tasks through prompting, where classification weights are generated from natural language descriptions of the target classes. However, a major hurdle in deploying these models is prompt engineering, which is time-consuming and requires substantial domain expertise. Small changes in wording can significantly affect performance, making the process labor-intensive. In this chapter, we discuss a simple yet effective approach called Context Optimization (CoOp) for adapting CLIP-like vision-language models for image recognition tasks. CoOp uses learnable vectors to model the context words of prompts while keeping the pre-trained model parameters fixed. On 11 benchmark datasets, CoOp outperforms hand-crafted prompts with as few as one or two examples. Despite being a learning-based approach, CoOp also exhibits excellent domain generalization, surpassing zero-shot models that use hand-crafted prompts.

Original languageEnglish
Title of host publicationLarge Vision-Language Models
Subtitle of host publicationPre-training, Prompting, and Applications
EditorsKaiyang Zhou, Ziwei Liu, Peng Gao
Place of PublicationCham
PublisherSpringer Cham
Chapter5
Pages115-133
Number of pages19
ISBN (Electronic)9783031949692
ISBN (Print)9783031949685, 9783031949715
DOIs
Publication statusPublished - 30 Aug 2025

Publication series

NameAdvances in Computer Vision and Pattern Recognition
VolumePart F886
ISSN (Print)2191-6586
ISSN (Electronic)2191-6594

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 9 - Industry, Innovation, and Infrastructure
    SDG 9 Industry, Innovation, and Infrastructure

User-Defined Keywords

  • Contrastive Learning
  • Domain Generalization
  • Few-Shot Learning
  • Image Classification
  • Prompt Engineering
  • Prompt Learning
  • Robustness
  • Vision-Language Model
  • Zero-Shot Learning

Fingerprint

Dive into the research topics of 'Differentiable Prompt Learning for Vision-Language Models'. Together they form a unique fingerprint.

Cite this