TY - JOUR
T1 - Alignclip: navigating the misalignments for robust vision-language generalization
AU - Han, Zhongyi
AU - Luo, Gongxu
AU - Sun, Hao
AU - Li, Yaqian
AU - Han, Bo
AU - Gong, Mingming
AU - Zhang, Kun
AU - Liu, Tongliang
N1 - Tongliang Liu is partially supported by the following Australian Research Council projects: FT220100318, DP220102121, LP220100527, LP220200949, and IC190100031.
Publisher Copyright:
© The Author(s) 2025.
PY - 2025/3
Y1 - 2025/3
N2 - In the realm of Vision-Language Pretraining models, achieving robust and adaptive representations is a cornerstone for successfully handling the unpredictability of real-world scenarios. This paper delves into two pivotal misalignment challenges inherent to Contrastive Language-Image Pre-training (CLIP) models: attention misalignment, which leads to an overemphasis on background elements rather than salient objects, and predictive category misalignment, characterized by the model’s struggle to discern between classes based on similarity. These misalignments undermine the representational stability essential for dynamic, real-world applications. To address these challenges, we propose AlignCLIP, an advanced fine-tuning methodology distinguished by its attention alignment loss, designed to calibrate the distribution of attention across multi-head attention layers. Furthermore, AlignCLIP introduces semantic label smoothing, a technique that leverages textual class similarities to refine prediction hierarchies. Through comprehensive experimentation on a variety of datasets and in scenarios involving distribution shifts and unseen classes, we demonstrate that AlignCLIP significantly enhances the stability of representations and shows superior generalization capabilities.
AB - In the realm of Vision-Language Pretraining models, achieving robust and adaptive representations is a cornerstone for successfully handling the unpredictability of real-world scenarios. This paper delves into two pivotal misalignment challenges inherent to Contrastive Language-Image Pre-training (CLIP) models: attention misalignment, which leads to an overemphasis on background elements rather than salient objects, and predictive category misalignment, characterized by the model’s struggle to discern between classes based on similarity. These misalignments undermine the representational stability essential for dynamic, real-world applications. To address these challenges, we propose AlignCLIP, an advanced fine-tuning methodology distinguished by its attention alignment loss, designed to calibrate the distribution of attention across multi-head attention layers. Furthermore, AlignCLIP introduces semantic label smoothing, a technique that leverages textual class similarities to refine prediction hierarchies. Through comprehensive experimentation on a variety of datasets and in scenarios involving distribution shifts and unseen classes, we demonstrate that AlignCLIP significantly enhances the stability of representations and shows superior generalization capabilities.
KW - Attention alignment
KW - Class representation stability
KW - Domain generalization
KW - Semantic label smoothing
KW - Vision-language pretraining
UR - http://www.scopus.com/inward/record.url?scp=85218209050&partnerID=8YFLogxK
U2 - 10.1007/s10994-025-06742-z
DO - 10.1007/s10994-025-06742-z
M3 - Journal article
AN - SCOPUS:85218209050
SN - 0885-6125
VL - 114
JO - Machine Learning
JF - Machine Learning
IS - 3
M1 - 58
ER -