Abstract
Despite the significant advancements of Large Language Models (LLMs) in recent years, tokenization remains a critical step in Natural Language Processing (NLP) for social scientific research. This study presents a simple but effective approach to enhance tokenization accuracy in segmenting user-generated content (UGC) in logographic languages, such as Chinese. Existing tokenization techniques often struggle to effectively handle the complexities of UGC on digital platforms, which include informal language, slang, and newly coined terms. To address this challenge, we developed a dynamic tokenization model that incorporates cumulative logic to recognize and adapt to evolving linguistic patterns in social media content. By analyzing large online discussion datasets from LIHKG, a Reddit-like forum in Hong Kong, the model’s effectiveness is demonstrated through its ability to accurately segment domain-specific terms and novel expressions over time. Our results show that the model outperforms traditional tokenizers in recognizing contextually relevant tokens. This innovative approach offers practical advantages for analyzing large-scale UGC data, and has the potential to improve the performance of downstream NLP tasks.
Original language | English |
---|---|
Journal | Journal of Computational Social Science |
Publication status | Accepted/In press - 19 Jun 2025 |
User-Defined Keywords
- Tokenization
- User-generated content
- Logographic languages
- natural language processing (NLP)
- Dynamic patterns