Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy

Shuhai Zhang, Yiliao Song, Jiahao Yang, Yuanqing Li*, Bo Han*, Mingkui Tan*

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

1 Citation (Scopus)

Abstract

Large language models (LLMs) such as ChatGPT have exhibited remarkable performance in generating human-like texts. However, machine-generated texts (MGTs) may carry critical risks, such as plagiarism issues and hallucination information. Therefore, it is very urgent and important to detect MGTs in many situations. Unfortunately, it is challenging to distinguish MGTs and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of LLMS. In this paper, we seek to exploit \textit{maximum mean discrepancy} (MMD) to address this issue in the sense that MMD can well identify distributional discrepancies. However, directly training a detector with MMD using diverse MGTs will incur a significantly increased variance of MMD since MGTs may contain \textit{multiple text populations} due to various LLMs. This will severely impair MMD's ability to measure the difference between two samples. To tackle this, we propose a novel \textit{multi-population} aware optimization method for MMD called MMD-MP, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. Relying on MMD-MP, we develop two methods for paragraph-based and sentence-based detection, respectively. Extensive experiments on various LLMs, \eg, GPT2 and ChatGPT, show superior detection performance of our MMD-MP.

Original languageEnglish
Title of host publicationProceedings of the Twelfth International Conference on Learning Representations, ICLR 2024
PublisherInternational Conference on Learning Representations
Pages1-36
Number of pages18
Publication statusPublished - May 2024
Event12th International Conference on Learning Representations, ICLR 2024 - Messe Wien Exhibition and Congress Center, Vienna, Austria
Duration: 7 May 202411 May 2024
https://iclr.cc/Conferences/2024 (Conference website)
https://iclr.cc/virtual/2024/calendar (Conference schedule )
https://openreview.net/group?id=ICLR.cc/2024/Conference#tab-accept-oral (Conference proceedings)

Publication series

NameProceedings of the International Conference on Learning Representations, ICLR

Conference

Conference12th International Conference on Learning Representations, ICLR 2024
Country/TerritoryAustria
CityVienna
Period7/05/2411/05/24
Internet address

Scopus Subject Areas

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Detecting Machine-Generated Texts by Multi-Population Aware Optimization for Maximum Mean Discrepancy'. Together they form a unique fingerprint.

Cite this