TY - GEN
T1 - Efficient sparse-dense matrix-matrix multiplication on GPUs using the customized sparse storage format
AU - Shi, Shaohuai
AU - Wang, Qiang
AU - Chu, Xiaowen
N1 - Funding Information:
The research was supported in part by Hong Kong RGC GRF grants under the contracts HKBU 12200418.
PY - 2020/12
Y1 - 2020/12
N2 - Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8x speedup over Nvidia's library cuSPARSE in many matrices.
AB - Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8x speedup over Nvidia's library cuSPARSE in many matrices.
KW - COO
KW - GCOO
KW - GPU
KW - Sparse Matrix Multiplication
UR - http://www.scopus.com/inward/record.url?scp=85102368885&partnerID=8YFLogxK
U2 - 10.1109/ICPADS51040.2020.00013
DO - 10.1109/ICPADS51040.2020.00013
M3 - Conference contribution
AN - SCOPUS:85102368885
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 19
EP - 26
BT - Proceedings - 2020 IEEE 26th International Conference on Parallel and Distributed Systems, ICPADS 2020
PB - IEEE Computer Society
T2 - 26th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2020
Y2 - 2 December 2020 through 4 December 2020
ER -