TY - GEN
T1 - Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
AU - Yan, Da
AU - Wang, Wei
AU - Chu, Xiaowen
N1 - Funding information:
This work was supported in part by Hong Kong RGC ECS grant 26213818 and GRF grant HKBU 12200418.
Publisher copyright:
© 2020 IEEE
PY - 2020/5
Y1 - 2020/5
N2 - Half-precision matrix multiply has played a key role in the training of deep learning models. The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small matrix multiply, based on which Half-precision General Matrix Multiply (HGEMM) routines are developed and can be accessed through high-level APIs. In this paper, we, for the first time, demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations. We further benchmark the memory system of Turing GPUs and conduct quantitative analysis of the performance. Our analysis shows that the bandwidth of DRAM, L2 cache and shared memory is the new bottleneck for HGEMM, whose performance is previously believed to be bound by computation. Based on our newly discovered features of Tensor Cores, we apply a series of optimization techniques on the Tensor Core-based HGEMM, including blocking size optimization, data layout redesign, data prefetching, and instruction scheduling. Extensive evaluation results show that our optimized HGEMM routine achieves an average of 1.73× and 1.46× speedup over the native implementation of cuBLAS 10.1 on NVIDIA Turing RTX2070 and T4 GPUs, respectively. The code of our implementation is written in native hardware assembly (SASS).
AB - Half-precision matrix multiply has played a key role in the training of deep learning models. The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small matrix multiply, based on which Half-precision General Matrix Multiply (HGEMM) routines are developed and can be accessed through high-level APIs. In this paper, we, for the first time, demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations. We further benchmark the memory system of Turing GPUs and conduct quantitative analysis of the performance. Our analysis shows that the bandwidth of DRAM, L2 cache and shared memory is the new bottleneck for HGEMM, whose performance is previously believed to be bound by computation. Based on our newly discovered features of Tensor Cores, we apply a series of optimization techniques on the Tensor Core-based HGEMM, including blocking size optimization, data layout redesign, data prefetching, and instruction scheduling. Extensive evaluation results show that our optimized HGEMM routine achieves an average of 1.73× and 1.46× speedup over the native implementation of cuBLAS 10.1 on NVIDIA Turing RTX2070 and T4 GPUs, respectively. The code of our implementation is written in native hardware assembly (SASS).
KW - GEMM
KW - GPU
KW - Half-precision
KW - Tensor Core
UR - http://www.scopus.com/inward/record.url?scp=85088904264&partnerID=8YFLogxK
U2 - 10.1109/IPDPS47924.2020.00071
DO - 10.1109/IPDPS47924.2020.00071
M3 - Conference proceeding
AN - SCOPUS:85088904264
T3 - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020
SP - 634
EP - 643
BT - Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020
PB - IEEE
T2 - 34th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2020
Y2 - 18 May 2020 through 22 May 2020
ER -