(2023-11-12) DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication
| Author: Yuechen Lu; Weifeng Liu; |
|---|
| Journal: , 1-14, 2023. |
| Journal Tags: |
| Local Link: Lu and Liu - 2023 - DASP Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multi.pdf |
| DOI: 10.1145/3581784.3607051 |
| Abstract: Sparse matrix-vector multiplication (SpMV) plays a key role in computational science and engineering, graph processing, and machine learning applications. Much work on SpMV was devoted to resolving problems such as random access to the vector x and unbalanced load. However, we have experimentally found that the computation of inner products still occupies much overhead in the SpMV operation, which has been largely ignored in existing work. In this paper, we propose DASP, a new algorithm using specific dense MMA units for accelerating the compute part of general SpMV. We analyze the row-wise distribution of nonzeros and group the rows into three categories containing long, medium, and short rows, respectively. We then organize them into small blocks of proper sizes to meet the requirement of MMA computation. For the three categories, DASP offers different strategies to complete SpMV by efficiently utilizing the MMA units. |
| Tags: |
| Note Date: 2025/7/22 21:45:09 |
Research Core
Content
Research Background (ABSTRACT)
Sparse matrix-vector multiplication (SpMV) is one of the core operations in computational science and engineering, graph processing, and machine learning applications.
Existing SpMV research mainly focuses on the following two aspects:
- Solving the random access problem of vector x
- Balancing workload by restructuring approximately uniform-sized basic work units
Even though these existing works have shown promising improvements, the achieved performance is still unsatisfactory, unable to bring bandwidth utilization close to peak performance, indicating that there is still room for SpMV algorithm optimization.
Key Finding: The authors decomposed the execution time of standard CSR SpMV into three parts for analysis:
- RANDOM ACCESS: averages 25.1%
- COMPUTE: averages 21.1%
- MISCELLANEOUS: averages 53.8%
Experiments found that the computation part still occupies significant overhead, and this overhead has been largely ignored in existing work.
Research Method
Core Idea: Propose the DASP algorithm, using specific dense MMA units to accelerate the computation part of general SpMV. The core challenge is how to convert the irregular data distribution of sparse matrices into the regular layout required by MMA units.
Classification Strategy: All rows are divided into three categories based on the number of non-zero elements per row:
- Long Rows: Number of non-zero elements > MAX_LEN (256)
- Medium Rows: 4 < Number of non-zero elements ≤ MAX_LEN
- Short Rows: Number of non-zero elements ≤ 4
Targeted Processing: The DASP algorithm uses different strategies for three different row categories:
- Long rows: Split into fixed-size groups, padded with zeros to meet MMA computation requirements
- Medium rows: Distinguish between regular and irregular parts, using different computational units for each
- Short rows: Adopt a concatenation strategy, combining short rows into regular layouts to improve MMA unit utilization
Research Results
- FP64 Double Precision: On average 1.5-2x faster than existing algorithms, with up to 90x+ speedup
- FP16 Half Precision: On average 1.7x faster than cuSPARSE on A100, and 1.75x faster on H800
Broad Matrix Applicability: Tested on 2893 matrices from the SuiteSparse matrix collection, DASP performed well on the vast majority of matrices, proving the algorithm's generality and robustness.
Innovations
- Discovered that the computation part is the performance bottleneck of SpMV algorithms and used MMA acceleration units to achieve higher performance
- Proposed the DASP algorithm to make the irregular data distribution of sparse matrices regular for efficient utilization of MMA units
- Designed data layouts that can fully utilize the computational patterns of MMA units
- Proved that on Ampere and Hopper architecture GPUs, the DASP algorithm brings better performance than existing excellent SpMV algorithms on most matrices
Shortcomings
- Preprocessing Overhead: Data structure conversion requires additional preprocessing time. My test results show that for sizes less than 10^6, the SpMV preprocessing time is shorter than cuSPARSE, but for sizes greater than 10^6, DASP's preprocessing time shows a linear growth trend, while cuSPARSE preprocessing time doesn't change significantly. Moreover, for sizes greater than 10^6, DASP and cuSPARSE computation speeds are close, meaning that for sizes greater than 10^6, DASP's overall computation time will be significantly more than cuSPARSE.
- Increased Memory Consumption: The new data structure may consume more memory space than the original CSR format.
- Parameter Tuning Complexity: Threshold parameters for different categories (such as MAX_LEN, threshold) need to be tuned for different hardware and matrix characteristics.