Abstract
The Chinese/English Political Interpreting Corpus (CEPIC) is an online corpus with 6.5 million-word tokens. The corpus features a verbatim transcription of Putonghua, Cantonese and Interpreting speeches/interpretations, which is Part-of-Speech (POS) tagged and annotated with prosodic and paralinguistic features that are of concern to the study of interpreting and spoken language (Pan 2019).
This paper reports the procedure employed in the CEPIC’s POS tagging and its enhancement. In order to improve the accuracy rate of the POS tagging, different taggers were employed and compared during the process. In general, the most problematic is the POS tagging of the spoken Cantonese part of the corpus, which remains to have a low accuracy rate. In order to create a method to increase accuracy for the POS tagging of the spoken Cantonese part of the CEPIC, two methods were tested. The first method was to segment with Jieba segmentation engine, included in SegmentAnt (Anthony 2017), and to POS tag with Stanford POS tagger. The second method tested was to segment with Stanford Word Segmenter, and at the same time to add the POS tags with Stanford POS tagger. Processing segmentation and part of speech tagging at the same time was found to be the best option.
However, the issue of low accuracy level with spoken Cantonese texts persisted. The solution to overcome this low accuracy issue is to segment and POS tag a large dataset of spoken Cantonese texts, then revise as much as possible manually, with the help of regular expressions, and train a model for spoken Cantonese. In the CEPIC project, certain manually tagged and checked POS data were employed as basis to train and enhance performance of the Stanford tagger. We believe that this procedure will shed light on the enhancement of POS tagging for spoken language, in particular spoken Cantonese in the future.
This paper reports the procedure employed in the CEPIC’s POS tagging and its enhancement. In order to improve the accuracy rate of the POS tagging, different taggers were employed and compared during the process. In general, the most problematic is the POS tagging of the spoken Cantonese part of the corpus, which remains to have a low accuracy rate. In order to create a method to increase accuracy for the POS tagging of the spoken Cantonese part of the CEPIC, two methods were tested. The first method was to segment with Jieba segmentation engine, included in SegmentAnt (Anthony 2017), and to POS tag with Stanford POS tagger. The second method tested was to segment with Stanford Word Segmenter, and at the same time to add the POS tags with Stanford POS tagger. Processing segmentation and part of speech tagging at the same time was found to be the best option.
However, the issue of low accuracy level with spoken Cantonese texts persisted. The solution to overcome this low accuracy issue is to segment and POS tag a large dataset of spoken Cantonese texts, then revise as much as possible manually, with the help of regular expressions, and train a model for spoken Cantonese. In the CEPIC project, certain manually tagged and checked POS data were employed as basis to train and enhance performance of the Stanford tagger. We believe that this procedure will shed light on the enhancement of POS tagging for spoken language, in particular spoken Cantonese in the future.
Original language | English |
---|---|
Pages | 76 |
Number of pages | 1 |
Publication status | Published - 21 Jun 2021 |
Event | Translation Studies in East Asia: Tradition, Transition, Transcendence, 2021EAST - Online Duration: 11 Jun 2021 → 12 Jun 2021 http://www.cbs.polyu.edu.hk/2021east/ http://www.cbs.polyu.edu.hk/2021east/doc/2021EAST-Conference-e-booklet.pdf |
Conference
Conference | Translation Studies in East Asia: Tradition, Transition, Transcendence, 2021EAST |
---|---|
City | Online |
Period | 11/06/21 → 12/06/21 |
Internet address |
User-Defined Keywords
- parallel corpus
- corpora
- translation studies
- Chinese-to-English translation