A corpus of Chinese word segmentation agreement

Yiu-Kei Tsang*, Ming Yan, Jinger Pan, Megan Yin Kan Chan

*Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

The absence of explicit word boundaries is a distinctive characteristic of Chinese script, setting it apart from most alphabetic scripts, leading to word boundary disagreement among readers. Previous studies have examined how this feature may influence reading performance. However, further investigations are required to generate more ecologically valid and generalizable findings. In order to advance our understanding of the impact of word boundaries in Chinese reading, we introduce the Chinese Word Segmentation Agreement (CWSA) corpus. This corpus consists of 500 sentences, comprising 9813 character tokens and 1590 character types, and provides data on word segmentation agreement at each character position. The data revealed a high level of overall segmentation agreement (92%). However, participants disagreed on the position of word boundaries in 8.96% of the cases. Moreover, about 85% of the sentences contained at least one ambiguous word boundary. The character strings with high levels of disagreement were tentatively classified into three categories, namely the morphosyntactic type (e.g., “反映–了”), modifier–head type (e.g., “科學–教育”), and others (e.g., “大力–支持”). Finally, the agreement scores also significantly influenced reading behaviors, as evidenced by analyses with published eye movement data. Specifically, a high level of disagreement was associated with longer single fixation durations. We discuss the implications of these results and highlight how the CWSA corpus can facilitate future research on word segmentation in Chinese reading.
Original languageEnglish
Article number25
Number of pages15
JournalBehavior Research Methods
Volume57
Issue number1
Early online date28 Dec 2024
DOIs
Publication statusPublished - Jan 2025

User-Defined Keywords

  • Chinese reading
  • Corpus
  • Eye tracking
  • Word boundary agreement
  • Word segmentation

Fingerprint

Dive into the research topics of 'A corpus of Chinese word segmentation agreement'. Together they form a unique fingerprint.

Cite this