Part-of-Speech (POS) Tagging Enhancement for the Chinese/English Political Interpreting Corpus (CEPIC)

Jun Pan, Fernando Gabarron Barrios, Steven He

    Research output: Contribution to conferenceConference paperpeer-review

    Abstract

    The Chinese/English Political Interpreting Corpus (CEPIC) is an online corpus with 6.5 million-word tokens. The corpus features a verbatim transcription of Putonghua, Cantonese and Interpreting speeches/interpretations, which is Part-of-Speech (POS) tagged and annotated with prosodic and paralinguistic features that are of concern to the study of interpreting and spoken language (Pan 2019).

    This paper reports the procedure employed in the CEPIC’s POS tagging and its enhancement. In order to improve the accuracy rate of the POS tagging, different taggers were employed and compared during the process. In general, the most problematic is the POS tagging of the spoken Cantonese part of the corpus, which remains to have a low accuracy rate. In order to create a method to increase accuracy for the POS tagging of the spoken Cantonese part of the CEPIC, two methods were tested. The first method was to segment with Jieba segmentation engine, included in SegmentAnt (Anthony 2017), and to POS tag with Stanford POS tagger. The second method tested was to segment with Stanford Word Segmenter, and at the same time to add the POS tags with Stanford POS tagger. Processing segmentation and part of speech tagging at the same time was found to be the best option.

    However, the issue of low accuracy level with spoken Cantonese texts persisted. The solution to overcome this low accuracy issue is to segment and POS tag a large dataset of spoken Cantonese texts, then revise as much as possible manually, with the help of regular expressions, and train a model for spoken Cantonese. In the CEPIC project, certain manually tagged and checked POS data were employed as basis to train and enhance performance of the Stanford tagger. We believe that this procedure will shed light on the enhancement of POS tagging for spoken language, in particular spoken Cantonese in the future.
    Original languageEnglish
    Pages76
    Number of pages1
    Publication statusPublished - 21 Jun 2021
    EventTranslation Studies in East Asia: Tradition, Transition, Transcendence, 2021EAST - Online
    Duration: 11 Jun 202112 Jun 2021
    http://www.cbs.polyu.edu.hk/2021east/
    http://www.cbs.polyu.edu.hk/2021east/doc/2021EAST-Conference-e-booklet.pdf

    Conference

    ConferenceTranslation Studies in East Asia: Tradition, Transition, Transcendence, 2021EAST
    CityOnline
    Period11/06/2112/06/21
    Internet address

    User-Defined Keywords

    • parallel corpus
    • corpora
    • translation studies
    • Chinese-to-English translation

    Fingerprint

    Dive into the research topics of 'Part-of-Speech (POS) Tagging Enhancement for the Chinese/English Political Interpreting Corpus (CEPIC)'. Together they form a unique fingerprint.

    Cite this