Optimizing batched winograd convolution on GPUs

Da Yan, Wei Wang, Xiaowen CHU

Research output: Chapter in book/report/conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13× speedup on Volta V100 and up to 2.65× speedup on Turing RTX2070. On both Volta and Turing GPUs, our implementation achieves up to 93% of device peak. Apart from analyzing and benchmarking different high-level optimization options, we also build a SASS assembler TuringAs for Volta and Turing that enables tuning the performance at the native assembly level. The new optimization opportunities uncovered by TuringAs not only improve the Winograd convolution but can also benefit CUDA compilers and native assembly programming. We have released TuringAs as an open-source software. To the best of our knowledge, this is the first public-available assembler for Volta and Turing GPUs.

Original languageEnglish
Title of host publicationPPoPP 2020 - Proceedings of the 2020 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
PublisherAssociation for Computing Machinery
Pages32-44
Number of pages13
ISBN (Electronic)9781450368186
DOIs
Publication statusPublished - 19 Feb 2020
Event25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2020 - San Diego, United States
Duration: 22 Feb 202026 Feb 2020

Publication series

NameProceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP

Conference

Conference25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2020
Country/TerritoryUnited States
CitySan Diego
Period22/02/2026/02/20

Scopus Subject Areas

  • Software

User-Defined Keywords

  • Convolution
  • GPU
  • Performance

Fingerprint

Dive into the research topics of 'Optimizing batched winograd convolution on GPUs'. Together they form a unique fingerprint.

Cite this