TY - JOUR
T1 - Accelerating Multi-Exit BERT Inference via Curriculum Learning and Knowledge Distillation
AU - Gu, Shengwei
AU - Luo, Xiangfeng
AU - Wang, Xinzhi
AU - Guo, Yike
N1 - Publisher Copyright:
© 2023 World Scientific Publishing Company.
PY - 2023/3
Y1 - 2023/3
N2 - The real-time deployment of bidirectional encoder representations from transformers (BERT) is limited by its slow inference caused by its large number of parameters. Recently, multi-exit architecture has garnered scholarly attention for its ability to achieve a trade-off between performance and efficiency. However, its early exits suffer from a considerable performance reduction compared to the final classifier. To accelerate inference with minimal compensation of performance, we propose a novel training paradigm for multi-exit BERT performing at two levels: training samples and intermediate features. Specifically, for the training samples level, we leverage curriculum learning to guide the training process and improve the generalization capacity of the model. For the intermediate features level, we employ layer-wise distillation learning from shallow to deep layers to resolve the performance deterioration of early exits. The experimental results obtained on the benchmark datasets of textual entailment and answer selection demonstrate that the proposed training paradigm is effective and achieves state-of-the-art results. Furthermore, the layer-wise distillation can completely replace vanilla distillation and deliver superior performance on text entailment datasets.
AB - The real-time deployment of bidirectional encoder representations from transformers (BERT) is limited by its slow inference caused by its large number of parameters. Recently, multi-exit architecture has garnered scholarly attention for its ability to achieve a trade-off between performance and efficiency. However, its early exits suffer from a considerable performance reduction compared to the final classifier. To accelerate inference with minimal compensation of performance, we propose a novel training paradigm for multi-exit BERT performing at two levels: training samples and intermediate features. Specifically, for the training samples level, we leverage curriculum learning to guide the training process and improve the generalization capacity of the model. For the intermediate features level, we employ layer-wise distillation learning from shallow to deep layers to resolve the performance deterioration of early exits. The experimental results obtained on the benchmark datasets of textual entailment and answer selection demonstrate that the proposed training paradigm is effective and achieves state-of-the-art results. Furthermore, the layer-wise distillation can completely replace vanilla distillation and deliver superior performance on text entailment datasets.
KW - curriculum learning
KW - knowledge distillation
KW - Multi-exit architecture
UR - http://www.scopus.com/inward/record.url?scp=85149988602&partnerID=8YFLogxK
U2 - 10.1142/S0218194023500018
DO - 10.1142/S0218194023500018
M3 - Journal article
AN - SCOPUS:85149988602
SN - 0218-1940
VL - 33
SP - 395
EP - 413
JO - International Journal of Software Engineering and Knowledge Engineering
JF - International Journal of Software Engineering and Knowledge Engineering
IS - 3
ER -