Part-of-Speech (POS) tagging interpreting corpora: Methods developed for the Chinese/English Political Interpreting Corpus (CEPIC)

Jun Pan*, Fernando Gabarron Barrios, Haoshen, Steven He, Billy Tak Ming Wong

*Corresponding author for this work

    Research output: Contribution to journalJournal articlepeer-review

    51 Downloads (Pure)

    Abstract

    The Chinese/English Political Interpreting Corpus (CEPIC) is an open access corpus with about 6.5 millionword tokens. The corpusfeatures verbatim transcriptions of originalspeeches and interpretations in Putonghua, Cantonese, and English. It is part-of-speech (POS) tagged and annotated with prosodic and paralinguistic features. This paper reports the procedure of POS tagging the CEPIC. The most problematic step was POS tagging verbatim transcriptions of Cantonese, a popular “minority” language with limited language-related computer resources. Initial trials of existing tools and resources resulted in a low accuracy rate. The situation was complicated by the prosodic and paralinguistic annotations added to the verbatim transcriptions. In the CEPIC project, Stanford CoreNLP 3.9.2 (Manning et al. 2014) was employed. Manually tagged and checked POS data were used as basis to train and enhance performance of the Stanford tagger. The procedure involved conversion between Traditional and Simplified Chinese, and development of regular expressions to fix common issues. Methods to further enhance POS tagging Cantonese data were discussed at the end of the paper. Apart from contributing to interpreting corpora development, the procedure and discussions are believed to be able to shed light on the enhancement of POS tagging for spoken language, in particular “minority” languages.
    Original languageEnglish
    Pages (from-to)1-45
    Number of pages45
    JournalTranslation Quarterly
    Volume105
    Publication statusPublished - Sept 2022

    Fingerprint

    Dive into the research topics of 'Part-of-Speech (POS) tagging interpreting corpora: Methods developed for the Chinese/English Political Interpreting Corpus (CEPIC)'. Together they form a unique fingerprint.

    Cite this