On the Generalization Ability of Next-Token-Prediction Pretraining

  • Zhihao Li
  • , Xue Jiang
  • , Liyuan Liu
  • , Xuelin Zhang
  • , Hong Chen*
  • , Feng Zheng
  • *Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

Abstract

Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model’s generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences N, the maximum length of sequence m, and the count of parameters in the transformer model Θ. Additionally, experiments on public datasets verify our theoretical findings. Our code is available at https://github.com/Lizeihao/MININTP.

Original languageEnglish
Title of host publicationProceedings of the 42nd International Conference on Machine Learning, ICML 2025
PublisherML Research Press
Pages34943-34975
Number of pages33
Publication statusPublished - Jul 2025
Event42nd International Conference on Machine Learning, ICML 2025 - Vancouver Convention Center, Vancouver, Canada
Duration: 13 Jul 202519 Jul 2025
https://icml.cc/Conferences/2025 (Conference Website)
https://icml.cc/virtual/2025/calendar (Conference Calendar)
https://proceedings.mlr.press/v267/ (Conference Proceedings)

Publication series

NameProceedings of Machine Learning Research
PublisherML Research Press
Volume267

Conference

Conference42nd International Conference on Machine Learning, ICML 2025
Country/TerritoryCanada
CityVancouver
Period13/07/2519/07/25
Internet address

Fingerprint

Dive into the research topics of 'On the Generalization Ability of Next-Token-Prediction Pretraining'. Together they form a unique fingerprint.

Cite this