Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply

Da Yan, Wei Wang, Xiaowen CHU

Research output: Chapter in book/report/conference proceedingConference contributionpeer-review

5 Citations (Scopus)

Abstract

Half-precision matrix multiply has played a key role in the training of deep learning models. The newly designed Nvidia Tensor Cores offer the native instructions for half-precision small matrix multiply, based on which Half-precision General Matrix Multiply (HGEMM) routines are developed and can be accessed through high-level APIs. In this paper, we, for the first time, demystify how Tensor Cores on NVIDIA Turing architecture work in great details, including the instructions used, the registers and data layout required, as well as the throughput and latency of Tensor Core operations. We further benchmark the memory system of Turing GPUs and conduct quantitative analysis of the performance. Our analysis shows that the bandwidth of DRAM, L2 cache and shared memory is the new bottleneck for HGEMM, whose performance is previously believed to be bound by computation. Based on our newly discovered features of Tensor Cores, we apply a series of optimization techniques on the Tensor Core-based HGEMM, including blocking size optimization, data layout redesign, data prefetching, and instruction scheduling. Extensive evaluation results show that our optimized HGEMM routine achieves an average of 1.73× and 1.46× speedup over the native implementation of cuBLAS 10.1 on NVIDIA Turing RTX2070 and T4 GPUs, respectively. The code of our implementation is written in native hardware assembly (SASS).

Original languageEnglish
Title of host publicationProceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages634-643
Number of pages10
ISBN (Electronic)9781728168760
DOIs
Publication statusPublished - May 2020
Event34th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2020 - New Orleans, United States
Duration: 18 May 202022 May 2020

Publication series

NameProceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020

Conference

Conference34th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2020
Country/TerritoryUnited States
CityNew Orleans
Period18/05/2022/05/20

Scopus Subject Areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Safety, Risk, Reliability and Quality

User-Defined Keywords

  • GEMM
  • GPU
  • Half-precision
  • Tensor Core

Fingerprint

Dive into the research topics of 'Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply'. Together they form a unique fingerprint.

Cite this