TY - GEN
T1 - Self-attention-based fully-inception networks for continuous sign language recognition
AU - Zhou, Mingjie
AU - Ng, Michael
AU - Cai, Zixin
AU - Cheung, Ka Chun
N1 - Publisher Copyright:
© 2020 The authors and IOS Press.
PY - 2020/8/24
Y1 - 2020/8/24
N2 - In hearing-loss community, sign language is a primary tool to communicate with people while there is communication gap between hearing-loss people with normal hearing people. Continuous sign language recognition, which can bridge the communication gap, is a challenging task because of the weakly supervised ordered annotations where no frame-level label is provided. To overcome this problem, connectionist temporal classification (CTC) is the most widely used method. However, CTC learning could perform bad if extracted features are not good. For better feature extraction, this work presents the novel self-attention-based fully-inception (SAFI) networks for vision-based end-to-end continuous sign language recognition. Considering the length of sign words differs from each other, we introduce fully inception network with different receptive field to extract dynamic clip-level features. To further boost the performance, the fully inception network with an auxiliary classifier is trained with aggregation cross entropy (ACE) loss. Then the self-attention networks as global sequential feature extractor is used to model the clip-level features with CTC. The proposed model is optimized by jointly training with ACE on clip-level feature learning and CTC on global sequential feature learning in an end-to-end fashion. The best method in the baselines achieves 35.6% WER on validation set and 34.5% WER on test set. It employs a better decoding algorithm for pseudo label to do the EM-like optimization to fine tune CNN module. In contrast, our approach focuses on the better feature extraction for end-to-end learning. To alleviate the overfitting on the limited dataset, we employ temporal elastic deformation to triple the real-world dataset RWTH-PHOENIX-Weather 2014. Experimental results on the real-world dataset RWTH-PHOENIX-Weather 2014 demonstrate the effectiveness of our approach which achieves 31.7% WER on validation set and 31.3% WER on test set.
AB - In hearing-loss community, sign language is a primary tool to communicate with people while there is communication gap between hearing-loss people with normal hearing people. Continuous sign language recognition, which can bridge the communication gap, is a challenging task because of the weakly supervised ordered annotations where no frame-level label is provided. To overcome this problem, connectionist temporal classification (CTC) is the most widely used method. However, CTC learning could perform bad if extracted features are not good. For better feature extraction, this work presents the novel self-attention-based fully-inception (SAFI) networks for vision-based end-to-end continuous sign language recognition. Considering the length of sign words differs from each other, we introduce fully inception network with different receptive field to extract dynamic clip-level features. To further boost the performance, the fully inception network with an auxiliary classifier is trained with aggregation cross entropy (ACE) loss. Then the self-attention networks as global sequential feature extractor is used to model the clip-level features with CTC. The proposed model is optimized by jointly training with ACE on clip-level feature learning and CTC on global sequential feature learning in an end-to-end fashion. The best method in the baselines achieves 35.6% WER on validation set and 34.5% WER on test set. It employs a better decoding algorithm for pseudo label to do the EM-like optimization to fine tune CNN module. In contrast, our approach focuses on the better feature extraction for end-to-end learning. To alleviate the overfitting on the limited dataset, we employ temporal elastic deformation to triple the real-world dataset RWTH-PHOENIX-Weather 2014. Experimental results on the real-world dataset RWTH-PHOENIX-Weather 2014 demonstrate the effectiveness of our approach which achieves 31.7% WER on validation set and 31.3% WER on test set.
UR - http://www.scopus.com/inward/record.url?scp=85091779592&partnerID=8YFLogxK
U2 - 10.3233/FAIA200425
DO - 10.3233/FAIA200425
M3 - Conference proceeding
AN - SCOPUS:85091779592
SN - 9781643681009
T3 - Frontiers in Artificial Intelligence and Applications
SP - 2832
EP - 2839
BT - ECAI 2020 - 24th European Conference on Artificial Intelligence, including 10th Conference on Prestigious Applications of Artificial Intelligence, PAIS 2020 - Proceedings
A2 - De Giacomo, Giuseppe
A2 - Catala, Alejandro
A2 - Dilkina, Bistra
A2 - Milano, Michela
A2 - Barro, Senen
A2 - Bugarin, Alberto
A2 - Lang, Jerome
PB - IOS Press BV
T2 - 24th European Conference on Artificial Intelligence, ECAI 2020, including 10th Conference on Prestigious Applications of Artificial Intelligence, PAIS 2020
Y2 - 29 August 2020 through 8 September 2020
ER -