Enhancing tokenization accuracy with dynamic patterns: Cumulative logic for segmenting user-generated content in logographic languages

Yin Zhang*, Zhihuai Lin, Castiel C. Tong, Sam W. Ho

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

Despite the significant advancements of Large Language Models (LLMs) in recent years, tokenization remains a critical step in Natural Language Processing (NLP) for social scientific research. This study presents a simple but effective approach to enhance tokenization accuracy in segmenting user-generated content (UGC) in logographic languages, such as Chinese. Existing tokenization techniques often struggle to effectively handle the complexities of UGC on digital platforms, which include informal language, slang, and newly coined terms. To address this challenge, we developed a dynamic tokenization model that incorporates cumulative logic to recognize and adapt to evolving linguistic patterns in social media content. By analyzing large online discussion datasets from LIHKG, a Reddit-like forum in Hong Kong, the model’s effectiveness is demonstrated through its ability to accurately segment domain-specific terms and novel expressions over time. Our results show that the model outperforms traditional tokenizers in recognizing contextually relevant tokens. This innovative approach offers practical advantages for analyzing large-scale UGC data, and has the potential to improve the performance of downstream NLP tasks.
Original languageEnglish
JournalJournal of Computational Social Science
Publication statusAccepted/In press - 19 Jun 2025

User-Defined Keywords

  • Tokenization
  • User-generated content
  • Logographic languages
  • natural language processing (NLP)
  • Dynamic patterns

Fingerprint

Dive into the research topics of 'Enhancing tokenization accuracy with dynamic patterns: Cumulative logic for segmenting user-generated content in logographic languages'. Together they form a unique fingerprint.

Cite this