Noisy Test-Time Adaptation in Vision-Language Models

Chentao Cao, Zhun Zhong*, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han*

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

Abstract

Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing. In open-world scenarios, models often encounter noisy samples, i.e., samples outside the in-distribution (ID) label space. Leveraging the zero-shot capability of pre-trained vision-language models (VLMs), this paper introduces Zero-Shot Noisy TTA (ZS-NTTA), focusing on adapting the model to target data with noisy samples during test-time in a zero-shot manner. In the preliminary study, we reveal that existing TTA methods suffer from a severe performance decline under ZS-NTTA, often lagging behind even the frozen model. We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. In addition, as these methods adopt the adapting classifier to implement ID classification and noise detection sub-tasks, the ability of the model in both sub-tasks is largely hampered. Based on this analysis, we propose a novel framework that decouples the classifier and detector, focusing on developing an individual detector while keeping the classifier (including the backbone) frozen. Technically, we introduce the Adaptive Noise Detector (AdaND), which utilizes the frozen model's outputs as pseudo-labels to train a noise detector for detecting noisy samples effectively. To address clean data streams, we further inject Gaussian noise during adaptation, preventing the detector from misclassifying clean samples as noisy. Beyond the ZS-NTTA, AdaND can also improve the zero-shot out-of-distribution (ZS-OOD) detection ability of VLMs. Extensive experiments show that our method outperforms in both ZS-NTTA and ZS-OOD detection. On ImageNet, AdaND achieves a notable improvement of 8.32% in harmonic mean accuracy (AccH) for ZS-NTTA and 9.40% in FPR95 for ZS-OOD detection, compared to state-of-the-art methods. Importantly, AdaND is computationally efficient and comparable to the model-frozen method. The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA.

Original languageEnglish
Title of host publicationProceedings of the Thirteenth International Conference on Learning Representations, ICLR 2025
PublisherInternational Conference on Learning Representations, ICLR
Pages47183-47214
Number of pages32
ISBN (Electronic)9798331320850
Publication statusPublished - 24 Apr 2025
Event13th International Conference on Learning Representations, ICLR 2025 - , Singapore
Duration: 24 Apr 202528 Apr 2025
https://iclr.cc/Conferences/2025 (Conference website)
https://openreview.net/group?id=ICLR.cc/2025/Conference#tab-accept-oral (Conference proceedings)

Publication series

NameInternational Conference on Learning Representations, ICLR

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
Country/TerritorySingapore
Period24/04/2528/04/25
Internet address

Fingerprint

Dive into the research topics of 'Noisy Test-Time Adaptation in Vision-Language Models'. Together they form a unique fingerprint.

Cite this