Part-of-Speech (POS) Tagging Enhancement for the Chinese/English Political Interpreting Corpus (CEPIC)

Jun Pan, Fernando Gabarron Barrios, Steven He

Research output: Contribution to conferenceConference paperpeer-review

Abstract

The Chinese/English Political Interpreting Corpus (CEPIC) is an online corpus with 6.5 million-word tokens. The corpus features a verbatim transcription of Putonghua, Cantonese and Interpreting speeches/interpretations, which is Part-of-Speech (POS) tagged and annotated with prosodic and paralinguistic features that are of concern to the study of interpreting and spoken language (Pan 2019).

This paper reports the procedure employed in the CEPIC’s POS tagging and its enhancement. In order to improve the accuracy rate of the POS tagging, different taggers were employed and compared during the process. In general, the most problematic is the POS tagging of the spoken Cantonese part of the corpus, which remains to have a low accuracy rate. In order to create a method to increase accuracy for the POS tagging of the spoken Cantonese part of the CEPIC, two methods were tested. The first method was to segment with Jieba segmentation engine, included in SegmentAnt (Anthony 2017), and to POS tag with Stanford POS tagger. The second method tested was to segment with Stanford Word Segmenter, and at the same time to add the POS tags with Stanford POS tagger. Processing segmentation and part of speech tagging at the same time was found to be the best option.

However, the issue of low accuracy level with spoken Cantonese texts persisted. The solution to overcome this low accuracy issue is to segment and POS tag a large dataset of spoken Cantonese texts, then revise as much as possible manually, with the help of regular expressions, and train a model for spoken Cantonese. In the CEPIC project, certain manually tagged and checked POS data were employed as basis to train and enhance performance of the Stanford tagger. We believe that this procedure will shed light on the enhancement of POS tagging for spoken language, in particular spoken Cantonese in the future.
Original languageEnglish
Pages76
Number of pages1
Publication statusPublished - 21 Jun 2021
EventTranslation Studies in East Asia: Tradition, Transition, Transcendence, 2021EAST - Online
Duration: 11 Jun 202112 Jun 2021
http://www.cbs.polyu.edu.hk/2021east/
http://www.cbs.polyu.edu.hk/2021east/doc/2021EAST-Conference-e-booklet.pdf

Conference

ConferenceTranslation Studies in East Asia: Tradition, Transition, Transcendence, 2021EAST
CityOnline
Period11/06/2112/06/21
Internet address

User-Defined Keywords

  • parallel corpus
  • corpora
  • translation studies
  • Chinese-to-English translation

Fingerprint

Dive into the research topics of 'Part-of-Speech (POS) Tagging Enhancement for the Chinese/English Political Interpreting Corpus (CEPIC)'. Together they form a unique fingerprint.

Cite this