Easy and Efficient Transformer: Scalable Inference Solution For Large NLP Model

Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Ziyang Luo, Rongsheng Zhang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao*

*Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

1 Citation (Scopus)

Abstract

Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill such a gap, we introduce a scalable inference solution: Easy and Efficient Transformer (EET), including a series of transformer inference optimization at the algorithm and implementation levels. First, we design highly optimized kernels for long inputs and large hidden sizes. Second, we propose a flexible CUDA memory manager to reduce the memory footprint when deploying a large model. Compared with the state-of-the-art transformer inference library (Faster Transformer v4.0), EET can achieve an average of 1.40-4.20x speedup on the transformer decoder layer with an A100 GPU.

Original languageEnglish
Title of host publicationProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track
EditorsAnastassia Loukina, Rashmi Gangadharaiah, Bonan Min
PublisherAssociation for Computational Linguistics (ACL)
Pages62-68
Number of pages7
ISBN (Electronic)9781955917728
Publication statusPublished - 10 Jul 2022
Event2022 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2022 - Virtual, Seattle, United States
Duration: 10 Jul 202215 Jul 2022
https://2022.naacl.org/ (Conference website)
https://2022.naacl.org/downloads/handbook-final-v2.pdf (Conference handbook)
https://aclanthology.org/events/naacl-2022/ (Conference proceedings)

Publication series

NameConference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Industry Papers

Conference

Conference2022 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL 2022
Country/TerritoryUnited States
CitySeattle
Period10/07/2215/07/22
Internet address

Fingerprint

Dive into the research topics of 'Easy and Efficient Transformer: Scalable Inference Solution For Large NLP Model'. Together they form a unique fingerprint.

Cite this