MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition

Xing Wu*, Yifan Jin, Jianjia Wang, Quan Qian, Yike Guo

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Large-scale automatic speech recognition model has achieved impressive performance. However, huge computational resources and massive amount of data are required to train an ASR model. Knowledge distillation is a prevalent model compression method which transfers the knowledge from large model to small model. To improve the efficiency of knowledge distillation for end-to-end speech recognition especially in the low-resource setting, a Mixup-based Knowledge Distillation (MKD) method is proposed which combines Mixup, a data-agnostic data augmentation method, with softmax-level knowledge distillation. A loss-level mixture is presented to address the problem caused by the non-linearity of label in the KL-divergence when adopting Mixup to the teacher–student framework. It is mathematically shown that optimizing the mixture of loss function is equivalent to optimize an upper bound of the original knowledge distillation loss. The proposed MKD takes the advantage of Mixup and brings robustness to the model even with a small amount of training data. The experiments on Aishell-1 show that MKD obtains a 15.6% and 3.3% relative improvement on two student models with different parameter scales compared with the existing methods. Experiments on data efficiency demonstrate MKD achieves similar results with only half of the original dataset.

Original languageEnglish
Article number160
JournalAlgorithms
Volume15
Issue number5
DOIs
Publication statusPublished - 11 May 2022

Scopus Subject Areas

  • Theoretical Computer Science
  • Numerical Analysis
  • Computational Theory and Mathematics
  • Computational Mathematics

User-Defined Keywords

  • data efficiency
  • end-to-end speech recognition
  • knowledge distillation
  • mixup
  • model compression

Fingerprint

Dive into the research topics of 'MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition'. Together they form a unique fingerprint.

Cite this