Abstract
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |
| Editors | Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 1441–1465 |
| Number of pages | 25 |
| ISBN (Electronic) | 9798891762510 |
| DOIs | |
| Publication status | Published - Jul 2025 |
| Event | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Austria Center Vienna, Vienna, Austria Duration: 27 Jul 2025 → 1 Aug 2025 https://2025.aclweb.org/ (Conference Website) https://docs.google.com/spreadsheets/d/1O-n3HPvv8vY0L_kjyP5AtRTcWWjqLk2deCYtrMgCGw4/edit?usp=drive_link (Conference Program) https://aclanthology.org/events/acl-2025/ (Conference Proceedings) |
Publication series
| Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
|---|---|
| ISSN (Print) | 0736-587X |
Conference
| Conference | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 |
|---|---|
| Country/Territory | Austria |
| City | Vienna |
| Period | 27/07/25 → 1/08/25 |
| Internet address |
|
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 16 Peace, Justice and Strong Institutions
Fingerprint
Dive into the research topics of '“Yes, My LoRD.” Guiding Language Model Extraction with Locality Reinforced Distillation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver