Alignclip: navigating the misalignments for robust vision-language generalization

Zhongyi Han*, Gongxu Luo, Hao Sun, Yaqian Li, Bo Han, Mingming Gong, Kun Zhang, Tongliang Liu*

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

In the realm of Vision-Language Pretraining models, achieving robust and adaptive representations is a cornerstone for successfully handling the unpredictability of real-world scenarios. This paper delves into two pivotal misalignment challenges inherent to Contrastive Language-Image Pre-training (CLIP) models: attention misalignment, which leads to an overemphasis on background elements rather than salient objects, and predictive category misalignment, characterized by the model’s struggle to discern between classes based on similarity. These misalignments undermine the representational stability essential for dynamic, real-world applications. To address these challenges, we propose AlignCLIP, an advanced fine-tuning methodology distinguished by its attention alignment loss, designed to calibrate the distribution of attention across multi-head attention layers. Furthermore, AlignCLIP introduces semantic label smoothing, a technique that leverages textual class similarities to refine prediction hierarchies. Through comprehensive experimentation on a variety of datasets and in scenarios involving distribution shifts and unseen classes, we demonstrate that AlignCLIP significantly enhances the stability of representations and shows superior generalization capabilities.

Original languageEnglish
Article number58
Number of pages19
JournalMachine Learning
Volume114
Issue number3
Early online date6 Feb 2025
DOIs
Publication statusPublished - Mar 2025

User-Defined Keywords

  • Attention alignment
  • Class representation stability
  • Domain generalization
  • Semantic label smoothing
  • Vision-language pretraining

Fingerprint

Dive into the research topics of 'Alignclip: navigating the misalignments for robust vision-language generalization'. Together they form a unique fingerprint.

Cite this