TY - JOUR
T1 - Enhancing tokenization accuracy with dynamic patterns: cumulative logic for segmenting user-generated content in logographic languages
AU - Zhang, Yin
AU - Lin, Zhihuai
AU - Tong, Castiel Chi-chiu
AU - Ho, Sam Wai-yeung
N1 - Open access funding provided by Hong Kong Baptist University Library. This study was supported by Research Matching Grant Scheme (RMG2021_9_03) of University Grants Committee, Hong Kong S.A.R.
Publisher copyright:
© The Author(s) 2025
PY - 2025/8
Y1 - 2025/8
N2 - Despite the significant advancements of Large Language Models (LLMs) in recent years, tokenization remains a critical step in Natural Language Processing (NLP) for social scientific research. This study presents a simple but effective approach to enhance tokenization accuracy in segmenting user-generated content (UGC) in logographic languages, such as Chinese. Existing tokenization techniques often struggle to effectively handle the complexities of UGC on digital platforms, which include informal language, slang, and newly coined terms. To address this challenge, we developed a dynamic tokenization model that incorporates cumulative logic to recognize and adapt to evolving linguistic patterns in social media content. By analyzing large online discussion datasets from LIHKG, a Reddit-like forum in Hong Kong, the model’s effectiveness is demonstrated through its ability to accurately segment domain-specific terms and novel expressions over time. Our results show that the model outperforms traditional tokenizers in recognizing contextually relevant tokens. This innovative approach offers practical advantages for analyzing large-scale UGC data, and has the potential to improve the performance of downstream NLP tasks.
AB - Despite the significant advancements of Large Language Models (LLMs) in recent years, tokenization remains a critical step in Natural Language Processing (NLP) for social scientific research. This study presents a simple but effective approach to enhance tokenization accuracy in segmenting user-generated content (UGC) in logographic languages, such as Chinese. Existing tokenization techniques often struggle to effectively handle the complexities of UGC on digital platforms, which include informal language, slang, and newly coined terms. To address this challenge, we developed a dynamic tokenization model that incorporates cumulative logic to recognize and adapt to evolving linguistic patterns in social media content. By analyzing large online discussion datasets from LIHKG, a Reddit-like forum in Hong Kong, the model’s effectiveness is demonstrated through its ability to accurately segment domain-specific terms and novel expressions over time. Our results show that the model outperforms traditional tokenizers in recognizing contextually relevant tokens. This innovative approach offers practical advantages for analyzing large-scale UGC data, and has the potential to improve the performance of downstream NLP tasks.
KW - Dynamic patterns
KW - Logographic languages
KW - Natural language processing (NLP)
KW - Tokenization
KW - User-generated content (UGC)
UR - http://www.scopus.com/inward/record.url?scp=105011284061&partnerID=8YFLogxK
U2 - 10.1007/s42001-025-00406-7
DO - 10.1007/s42001-025-00406-7
M3 - Journal article
SN - 2432-2717
VL - 8
JO - Journal of Computational Social Science
JF - Journal of Computational Social Science
IS - 3
M1 - 80
ER -