NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU

Cong Ma, Du Wu, Zhelang Deng, Jiang Chen, Xiaowen Huang, Jintao Meng, Wenxi Zhu, Bingqiang Wang, Amelie Chi Zhou, Peng Chen, Minwen Deng, Yanjie Wei, Shengzhong Feng, Yi Pan

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

Abstract

Deep learning demonstrates effectiveness across a wide range of tasks. However, the dense and over-parameterized nature of these models results in significant resource consumption during deployment. In response to this issue, weight pruning, particularly through $N: M$ sparsity matrix multiplication, offers an efficient solution by transforming dense operations into semisparse ones. $N: M$ sparsity provides an option for balancing performance and model accuracy, but introduces more complex programming and optimization challenges. To address these issues, we design a systematic top-down performance analysis model for $N: M$ sparsity. Meanwhile, NM-SpMM is proposed as an efficient general $N: M$ sparsity implementation. Based on our performance analysis, NM-SpMM employs a hierarchical blocking mechanism as a general optimization to enhance data locality, while memory access optimization and pipeline design are introduced as sparsity-aware optimization, allowing it to achieve close-to-theoretical peak performance across different sparsity levels. Experimental results show that NM-SpMM is 2.1x faster than nmSPARSE (the state-of-the-art for general $N: M$ sparsity) and 1.4× to 6.3× faster than cuBLAS's dense GEMM operations, closely approaching the theoretical maximum speedup resulting from the reduction in computation due to sparsity. NM-SpMM is open source and publicly available at https://github.com/M-H482/NM-SpMM.
Original languageEnglish
Title of host publication2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
EditorsLisa O'Conner
Place of PublicationMilano
PublisherIEEE
Pages926-937
Number of pages12
ISBN (Electronic)9798331532376
ISBN (Print)9798331532383
DOIs
Publication statusPublished - 7 Jun 2025
Event2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS) - Politecnico di Milano, Milano, Italy
Duration: 3 Jun 20257 Jun 2025
https://www.ipdps.org/ipdps2025/index.html (Conference website)
https://ieeexplore.ieee.org/xpl/conhome/11078457/proceeding (Conference proceeding)
https://www.ipdps.org/ipdps2025/2025-advance-program.html (Conference program)

Publication series

NameInternational Symposium on Parallel and Distributed Processing
PublisherIEEE

Conference

Conference2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Abbreviated titleIPDPS 2025
Country/TerritoryItaly
CityMilano
Period3/06/257/06/25
Internet address

User-Defined Keywords

  • N:M sparsity
  • GPU
  • Performance Optimization

Fingerprint

Dive into the research topics of 'NM-SpMM: Accelerating Matrix Multiplication Using N:M Sparsity with GPGPU'. Together they form a unique fingerprint.

Cite this