TY - JOUR
T1 - Enhancement of English-Bengali Machine Translation Leveraging Back-Translation
AU - Mondal, Subrota Kumar
AU - Wang, Chengwei
AU - Chen, Yijun
AU - Cheng, Yuning
AU - Huang, Yanbo
AU - Dai, Hong Ning
AU - Kabir, H. M. Dipu
N1 - This work was supported by The Science and Technology Development Fund of Macao, Macao SAR, China under grant 0033/2022/ITP.
Publisher Copyright:
© 2024 by the authors.
PY - 2024/8/5
Y1 - 2024/8/5
N2 - An English-Bengali machine translation (MT) application can convert an English text into a corresponding Bengali translation. To build a better model for this task, we can optimize English-Bengali MT. MT for languages with rich resources, like English-German, started decades ago. However, MT for languages lacking many parallel corpora remains challenging. In our study, we employed back-translation to improve the translation accuracy. With back-translation, we can have a pseudo-parallel corpus, and the generated (pseudo) corpus can be added to the original dataset to obtain an augmented dataset. However, the new data can be regarded as noisy data because they are generated by models that may not be trained very well or not evaluated well, like human translators. Since the original output of a translation model is a probability distribution of candidate words, to make the model more robust, different decoding methods are used, such as beam search, top-k random sampling and random sampling with temperature T, and others. Notably, top-k random sampling and random sampling with temperature T are more commonly used and more optimal decoding methods than the beam search. To this end, our study compares LSTM (Long-Short Term Memory, as a baseline) and Transformer. Our results show that Transformer (BLEU: 27.80 in validation, 1.33 in test) outperforms LSTM (3.62 in validation, 0.00 in test) by a large margin in the English-Bengali translation task. (Evaluating LSTM and Transformer without any augmented data is our baseline study.) We also incorporate two decoding methods, top-k random sampling and random sampling with temperature T, for back-translation that help improve the translation accuracy of the model. The results show that data generated by back-translation without top-k or temperature sampling (“no strategy”) help improve the accuracy (BLEU 38.22, +10.42 on validation, 2.07, +0.74 on test). Specifically, back-translation with top-k sampling is less effective (𝑘=10, BLEU 29.43, +1.83 on validation, 1.36, +0.03 on test), while sampling with a proper value of T, 𝑇=0.5 makes the model achieve a higher score (𝑇=0.5, BLEU 35.02, +7.22 on validation, 2.35, +1.02 on test). This implies that in English-Bengali MT, we can augment the training set through back-translation using random sampling with a proper temperature T.
AB - An English-Bengali machine translation (MT) application can convert an English text into a corresponding Bengali translation. To build a better model for this task, we can optimize English-Bengali MT. MT for languages with rich resources, like English-German, started decades ago. However, MT for languages lacking many parallel corpora remains challenging. In our study, we employed back-translation to improve the translation accuracy. With back-translation, we can have a pseudo-parallel corpus, and the generated (pseudo) corpus can be added to the original dataset to obtain an augmented dataset. However, the new data can be regarded as noisy data because they are generated by models that may not be trained very well or not evaluated well, like human translators. Since the original output of a translation model is a probability distribution of candidate words, to make the model more robust, different decoding methods are used, such as beam search, top-k random sampling and random sampling with temperature T, and others. Notably, top-k random sampling and random sampling with temperature T are more commonly used and more optimal decoding methods than the beam search. To this end, our study compares LSTM (Long-Short Term Memory, as a baseline) and Transformer. Our results show that Transformer (BLEU: 27.80 in validation, 1.33 in test) outperforms LSTM (3.62 in validation, 0.00 in test) by a large margin in the English-Bengali translation task. (Evaluating LSTM and Transformer without any augmented data is our baseline study.) We also incorporate two decoding methods, top-k random sampling and random sampling with temperature T, for back-translation that help improve the translation accuracy of the model. The results show that data generated by back-translation without top-k or temperature sampling (“no strategy”) help improve the accuracy (BLEU 38.22, +10.42 on validation, 2.07, +0.74 on test). Specifically, back-translation with top-k sampling is less effective (𝑘=10, BLEU 29.43, +1.83 on validation, 1.36, +0.03 on test), while sampling with a proper value of T, 𝑇=0.5 makes the model achieve a higher score (𝑇=0.5, BLEU 35.02, +7.22 on validation, 2.35, +1.02 on test). This implies that in English-Bengali MT, we can augment the training set through back-translation using random sampling with a proper temperature T.
KW - English
KW - bengali
KW - machine translation
KW - back-translation
KW - random sampling
KW - temperature
UR - http://www.scopus.com/inward/record.url?scp=85200845749&partnerID=8YFLogxK
U2 - 10.3390/app14156848
DO - 10.3390/app14156848
M3 - Journal article
AN - SCOPUS:85200845749
SN - 2076-3417
VL - 14
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 15
M1 - 6848
ER -