The Chinese/English Political Interpreting Corpus (CEPIC) is an open access corpus with about 6.5 millionword tokens. The corpusfeatures verbatim transcriptions of originalspeeches and interpretations in Putonghua, Cantonese, and English. It is part-of-speech (POS) tagged and annotated with prosodic and paralinguistic features. This paper reports the procedure of POS tagging the CEPIC. The most problematic step was POS tagging verbatim transcriptions of Cantonese, a popular “minority” language with limited language-related computer resources. Initial trials of existing tools and resources resulted in a low accuracy rate. The situation was complicated by the prosodic and paralinguistic annotations added to the verbatim transcriptions. In the CEPIC project, Stanford CoreNLP 3.9.2 (Manning et al. 2014) was employed. Manually tagged and checked POS data were used as basis to train and enhance performance of the Stanford tagger. The procedure involved conversion between Traditional and Simplified Chinese, and development of regular expressions to fix common issues. Methods to further enhance POS tagging Cantonese data were discussed at the end of the paper. Apart from contributing to interpreting corpora development, the procedure and discussions are believed to be able to shed light on the enhancement of POS tagging for spoken language, in particular “minority” languages.
|Number of pages
|Published - Sept 2022