RMerge2

Memory-efficient sparse matrix-matrix multiplication by row merging on many-core architectures. Sparse matrix-matrix multiplication (SpMM) is used for many computational tasks including algebraic multigrid solvers and graph operations. Our aim is to provide a fast SpMM implementation (RMerge2) for many-core architectures such as graphics processing units (GPUs) which operates with predictable and low memory overhead. RMerge2 breaks the sparse matrix-matrix product into many sparse vector-matrix products and computes these by parallel row merging. It merges up to 512 rows using warps of up to 32 threads, where each thread maintains states for multiple rows in registers. Larger numbers of rows (up to $32K$, where $K = 1024$) are merged by a whole thread block using shared memory. The rows of the left-hand side are grouped into different cases based on their number of nonzeros and are processed with specific kernels implemented using C++ templates. This approach computes SpMM with memory overhead of only 5 bytes per row for left-hand sides with up to $32K$ nonzeros per row. Latencies of small but long kernels are hidden by concurrent kernel execution based on streams. Performance measurements show that merging more than 1 row per thread improves performance for homogeneous matrices by factors of up to 2.4. Case-based and concurrent kernels improve performance, particularly for heterogeneous matrices, by factors up to 1.6 and 3.7, respectively. Compared to a parallel CPU implementation, RMerge2 achieves a mean speedup of 4.5, and compared to three other GPU implementations, speedup factors of 11.3, 8.6, and 2.5 for matrix squaring and 7.4, 1.9, and 2.4 for Galerkin products are achieved. The Pascal GPU is faster than a Kepler GPU by an average factor of 3.6 for matrix squaring, i.e., higher than expected from the increase in nominal peak performance and memory bandwidth, suggesting that other improvements, including faster memory allocation, stream creation, and more warp shuffle operations, contribute to the overall performance increase.