Towards Effective Evaluations and Comparisons for LLM Unlearning Methods

Qizhou Wang, Bo Han*, Puning Yang, Jianing Zhu, Tongliang Liu, Masashi Sugiyama

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

1 Citation (Scopus)

Abstract

The imperative to eliminate undesirable data memorization underscores the significance of machine unlearning for large language models (LLMs). Recent research has introduced a series of promising unlearning methods, notably boosting the practical significance of the field. Nevertheless, adopting a proper evaluation framework to reflect the true unlearning efficacy is also essential yet has not received adequate attention. This paper seeks to refine the evaluation of LLM unlearning by addressing two key challenges-a) the robustness of evaluation metrics and b) the trade-offs between competing goals. The first challenge stems from findings that current metrics are susceptible to various red teaming scenarios. It indicates that they may not reflect the true extent of knowledge retained by LLMs but rather tend to mirror superficial model behaviors, thus prone to attacks. We address this issue by devising and assessing a series of candidate metrics, selecting the most robust ones under various types of attacks. The second challenge arises from the conflicting goals of eliminating unwanted knowledge while retaining those of others. This trade-off between unlearning and retention often fails to conform the Pareto frontier, rendering it subtle to compare the efficacy between methods that excel only in either unlearning or retention. We handle this issue by proposing a calibration method that can restore the original performance on non-targeted data after unlearning, thereby allowing us to focus exclusively on assessing the strength of unlearning. Our evaluation framework notably enhances the effectiveness when assessing and comparing various LLM unlearning methods, further allowing us to benchmark existing works, identify their proper hyper-parameters, and explore new tricks to enhance their practical efficacy. The code is publicly available at: https://github.com/tmlr-group/Unlearning-with-Control.

Original languageEnglish
Title of host publicationProceedings of the Thirteenth International Conference on Learning Representations, ICLR 2025
PublisherInternational Conference on Learning Representations, ICLR
Pages93642-93670
Number of pages29
ISBN (Electronic)9798331320850
Publication statusPublished - 24 Apr 2025
Event13th International Conference on Learning Representations, ICLR 2025 - , Singapore
Duration: 24 Apr 202528 Apr 2025
https://iclr.cc/Conferences/2025 (Conference website)
https://openreview.net/group?id=ICLR.cc/2025/Conference#tab-accept-oral (Conference proceedings)

Publication series

NameInternational Conference on Learning Representations, ICLR

Conference

Conference13th International Conference on Learning Representations, ICLR 2025
Country/TerritorySingapore
Period24/04/2528/04/25
Internet address

Fingerprint

Dive into the research topics of 'Towards Effective Evaluations and Comparisons for LLM Unlearning Methods'. Together they form a unique fingerprint.

Cite this