TY - JOUR
T1 - On English-Chinese Neural Machine Translation leveraging Transformer model
AU - Mondal, Subrota Kumar
AU - Chen, Yijun
AU - Cheng, Yuning
AU - Dai, Hong-Ning
AU - Alam, Syed B.
AU - Kabir, H.M. Dipu
N1 - Publisher Copyright:
© 2025 The Author(s).
Funding Information:
This work was supported by The Science and Technology Development Fund of Macao, Macao SAR, China under grant 0033/2022/ITP.
PY - 2025/9
Y1 - 2025/9
N2 - In today’s era of globalization, people’s cross-cultural communication has become increasingly frequent, and photo translation (photo, image, or scene text translation) technology has become an important tool. By using this translation technology, people can easily recognize and translate text from other languages without the need for manual input or translation. This has important practical value for people in fields such as tourism, business, education, and research. Therefore, photo translation technology has become an indispensable tool, providing more convenience to people’s lives and work. To this, this paper aims to achieve high accuracy English to Chinese photo translation, which can be divided into three stages: text detection, text recognition, and text translation (i.e., machine translation). We observe that in text detection and recognition, we have challenges with occluded text, hand-written text, scene text, text with complex layout, distorted text, and many others. However, in this paper, we limit our analysis to Translation phase. For detection and recognition phase, we make use of current state-of-the-art methodologies, such as DBNet (Liao et al., 2020) model for detection and the ABINet (Fang et al., 2021) model for recognition. In the translation part, we use Transformer model with modifications towards improving the translation accuracy. The modifications are mainly reflected in two aspects: data preprocessing and optimizer. In the data preprocessing part, we use the BPE (Byte Pair Encoding) algorithm instead of basic word-centered tokenization algorithms. In the context, BPE algorithm can divide words into smaller subwords, which can solve the problem of rare words to some extent and provide better word vectors for language model training. In the optimizer part, we use the Lion model proposed by Google instead of the widely used Adam optimizer that helps reduce the loss more quickly than using Adam optimizer for small size batch — with batch size 256 achieves the lowest test loss 0.392842 (−1.072171) and the highest BLEU4 score 0.381281 (+0.24063). This adds value in reducing the consumption of training resources and the sustainability of deep learning.
AB - In today’s era of globalization, people’s cross-cultural communication has become increasingly frequent, and photo translation (photo, image, or scene text translation) technology has become an important tool. By using this translation technology, people can easily recognize and translate text from other languages without the need for manual input or translation. This has important practical value for people in fields such as tourism, business, education, and research. Therefore, photo translation technology has become an indispensable tool, providing more convenience to people’s lives and work. To this, this paper aims to achieve high accuracy English to Chinese photo translation, which can be divided into three stages: text detection, text recognition, and text translation (i.e., machine translation). We observe that in text detection and recognition, we have challenges with occluded text, hand-written text, scene text, text with complex layout, distorted text, and many others. However, in this paper, we limit our analysis to Translation phase. For detection and recognition phase, we make use of current state-of-the-art methodologies, such as DBNet (Liao et al., 2020) model for detection and the ABINet (Fang et al., 2021) model for recognition. In the translation part, we use Transformer model with modifications towards improving the translation accuracy. The modifications are mainly reflected in two aspects: data preprocessing and optimizer. In the data preprocessing part, we use the BPE (Byte Pair Encoding) algorithm instead of basic word-centered tokenization algorithms. In the context, BPE algorithm can divide words into smaller subwords, which can solve the problem of rare words to some extent and provide better word vectors for language model training. In the optimizer part, we use the Lion model proposed by Google instead of the widely used Adam optimizer that helps reduce the loss more quickly than using Adam optimizer for small size batch — with batch size 256 achieves the lowest test loss 0.392842 (−1.072171) and the highest BLEU4 score 0.381281 (+0.24063). This adds value in reducing the consumption of training resources and the sustainability of deep learning.
KW - Chinese
KW - English
KW - Image text
KW - Neural Machine Translation
KW - Optical character recognition
UR - https://www.scopus.com/pages/publications/105022095957
U2 - 10.1016/j.nlp.2025.100166
DO - 10.1016/j.nlp.2025.100166
M3 - Journal article
SN - 2949-7191
VL - 12
JO - Natural Language Processing Journal
JF - Natural Language Processing Journal
M1 - 100166
ER -