Enhancing tokenization accuracy with dynamic patterns: cumulative logic for segmenting user-generated content in logographic languages

Yin Zhang*, Zhihuai Lin, Castiel Chi-chiu Tong, Sam Wai-yeung Ho

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Despite the significant advancements of Large Language Models (LLMs) in recent years, tokenization remains a critical step in Natural Language Processing (NLP) for social scientific research. This study presents a simple but effective approach to enhance tokenization accuracy in segmenting user-generated content (UGC) in logographic languages, such as Chinese. Existing tokenization techniques often struggle to effectively handle the complexities of UGC on digital platforms, which include informal language, slang, and newly coined terms. To address this challenge, we developed a dynamic tokenization model that incorporates cumulative logic to recognize and adapt to evolving linguistic patterns in social media content. By analyzing large online discussion datasets from LIHKG, a Reddit-like forum in Hong Kong, the model’s effectiveness is demonstrated through its ability to accurately segment domain-specific terms and novel expressions over time. Our results show that the model outperforms traditional tokenizers in recognizing contextually relevant tokens. This innovative approach offers practical advantages for analyzing large-scale UGC data, and has the potential to improve the performance of downstream NLP tasks.
Original languageEnglish
Article number80
Number of pages24
JournalJournal of Computational Social Science
Volume8
Issue number3
Early online date23 Jul 2025
DOIs
Publication statusPublished - Aug 2025

User-Defined Keywords

  • Dynamic patterns
  • Logographic languages
  • Natural language processing (NLP)
  • Tokenization
  • User-generated content (UGC)

Fingerprint

Dive into the research topics of 'Enhancing tokenization accuracy with dynamic patterns: cumulative logic for segmenting user-generated content in logographic languages'. Together they form a unique fingerprint.

Cite this