Performance Optimization Techniques of Irregular-Shaped Matrix Multiplication on SW26010P

doi:10.3778/j.issn.1002-8331.2405-0142

Abstract

Abstract: Matrix multiplication is widely used in the field of scientific and engineering computing, and is the most important optimization object in BLAS. With the development of artificial neural networks, computational fluid mechanics and other fields, irregular-shaped matrix multiplication is rapidly gaining attention. This paper proposes parallelization techniques for irregular-shaped matrix multiplication on SW26010P, a domestic many-core processor deployed in the new generation Sunway supercomputer. Specifically, a parallel algorithm with diversified task partition mapping is designed to improve memory access bandwidth utilization rate based on the hardware characteristics and the data layout of matrix elements. At the same time, based on the hardware assembly lines and vectorized computation and data access instructions, the key computations are abstracted and the corresponding underlying compilation optimizations are performed to improve computational efficiency. And a data-sharing strategy under the RMA point to point communication mechanism is adopted to further reduce the overhead of data access and transmission, and the nested double buffering are used to further improve the performance. Besides, a series of experiments on SW26010P are conducted to determine the optimal number of blocks of different kinds of function parallelization calculation for the purpose of making full use of the hardware platform performance. The experimental results demonstrate that the performance of the irregular-shaped matrix multiplication optimized in this thesis can reach up to 93% of the upper bound of the theoretical performance. Compared with the massive GEMM algorithm implementation, the average performance acceleration of the irregular-shaped matrix multiplication is 5.43 times, and the optimal performance acceleration can reach up to 51.5 times.

Key words: irregular-shaped matrix multiplication, Sunway 26010P many-core processor, diversified task partition mapping, RMA point to point mechanism, nested double buffering techniques

摘要： 矩阵乘法广泛应用于科学与工程计算领域，是基础线性代数库中的关键优化对象。随着人工神经网络、计算流体力学等领域的快速发展，异形（irregular-shaped）矩阵乘法正在迅速引起关注。研究集中在针对国产新一代神威超级计算机采用的SW26010P众核处理器，探讨异形矩阵乘法的众核并行优化技术。具体而言，结合SW26010P的硬件特性和异形矩阵的数据布局，设计了多样化任务划分映射的并行算法，提高直接内存访问（direct memory access，DMA）访存带宽利用率。结合SW26010P的硬件流水线和向量化访存/计算指令，抽象运算中涉及的计算类型进行底层汇编优化，提高了计算效率。提出了远程内存访问（remote memory access，RMA）点对点机制下的数据共享策略，降低数据访存和传输开销，并提出了嵌套双缓冲技术进一步提高异形矩阵乘法的性能。此外，针对不同种类异形矩阵乘法行实现时面临的分块参数适配问题，基于SW26010P众核处理器进行实验分析研究，确定了各函数并行化时的最优分块参数。实验结果显著，所优化的异形矩阵乘法的性能最高可达roofline模型预测性能上限的93%，相较于常规大规模矩阵乘法算法平均获得了5.43倍的性能加速，最高可获得51.5倍的性能加速。

关键词: 异形矩阵乘法, SW26010P众核处理器, 多样化任务划分映射, RMA点对点机制, 嵌套双缓冲技术

HU Yi, CHEN Daokun, YANG Chao. Performance Optimization Techniques of Irregular-Shaped Matrix Multiplication on SW26010P[J]. Computer Engineering and Applications, 2025, 61(6): 150-163.

胡怡, 陈道琨, 杨超. 面向SW26010P的异形矩阵乘法众核并行优化技术研究[J]. 计算机工程与应用, 2025, 61(6): 150-163.

References

[1] 闫昊.面向申威1621国产多核处理器的稠密矩阵运算并行算法和性能优化技术研究[D]. 北京: 中国科学院大学，2021.
YAN H. Research on parallel algorithm and performance optimization technology of dense matrix computing for Shenwei 1621 domestic multicore processor[D]. Beijing: University of Chinese Academy of Sciences, 2021.
[2] KING D E. Dlib-ml: a machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10(3): 1755-1758.
[3] DEM?AR J, CURK T, ERJAVEC A, et al. Orange: data mining toolbox in python[J]. Journal of Machine Learning Research, 2013, 14(1): 2349-2353.
[4] ABHYANKAR S, BETRIE G, MALDONADO D A, et al. PETSc DMNetwork: a library for scalable network PDE-based multiphysics simulations[J]. Transactions on Mathematical Software (TOMS), 2020, 46(1): 1-24.
[5] ABADI M, BARHAM P, CHEN J, et al. TensorFlow: a system for large-scale machine learning[C]//Proceedings of USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016: 265-283.
[6] KARNIADAKIS G, SHERWIN S. Spectral/hp element methods for computational fluid dynamics[M]. Oxford: Oxford University Press, 2005.
[7] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[8] DHILLON I S, GUAN Y, KULIS B. Kernel k-means: spectral clustering and normalized cuts[C]//Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), 2004: 551-556.
[9] K-means by NVIDIA[EB/OL]. [2024-05-06]. https://github. com/NVIDIA/kmeans.
[10] LIU Y, LIU X, LI F, et al. Closing the “quantum supremacy” gap: achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021: 1-12.
[11] SHANG H, LI F, ZHANG Y, et al. Extreme-scale ab initio quantum raman spectra simulations on the leadership HPC system in China[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021.
[12] XIAO J, CHEN J, ZHENG J, et al. Symplectic structure-preserving particle-in-cell whole-volume simulation of tokamak plasmas to 111.3 trillion particles and 25.7 billion grids[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021.
[13] GOTO K, VAN DE GEIJN R. High-performance implementation of the Level-3 BLAS[J]. Transactions on Mathematical Software (TOMS), 2008, 35(1): 1-14.
[14] WILLIAMS S, WATERMAN A, PATTERSON D. Roofline: an insightful visual performance model for floating-point programs and multicore architectures[J]. Communications of the Association for Computing Machinery, 2009, 52(4): 65-76.
[15] AUER A, BAUMGARTNER G, BERNHOLDT D, et al. Automatic code generation for many-body electronic structure methods: the tensor contraction engine[J]. Molecular Physics, 2006, 104(2): 211-228.
[16] KHODAYARI A, ZOMORRODI A, LIAO J, et al. A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data[J]. Metabolic Engineering, 2014, 25: 50-62.
[17] HU Y, CHEN D K, YANG C, et al. Optimization techniques of parallel level 3 BLAS routines on domestic SW26010-Pro many-core processor[J]. Journal of Software, 2024, 35(3): 1569-1584.
[18] NERSC. Roofline performance model[EB/OL]. [2024-05-06]. https://docs.nersc.gov/tools/performance/roofline/.
[19] HAIDAR A, ABDELFATTAH A, ZOUNON M, et al. A guide for achieving high performance with very small matrices on GPU: a case study of batched LU and cholesky factorizations[J]. Transactions on Parallel and Distributed Systems, 2018, 29(5): 973-984.
[20] FRISON G, KOUZOUPIS D, SARTOR T, et al. BLASFEO: basic linear algebra subroutines for embedded optimization[J]. Transactions on Mathematical Software, 2018, 44(4): 42.
[21] YANG W L, FANG J B, DONG D Z, et al. LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021: 1-14.
[22] RIVERA C, CHEN J, XIONG N, et al. TSM2X: high-performance tall-and-skinny matrix-matrix multiplication on GPUs[J]. Journal of Parallel and Distributed Computing, 2021, 151(3): 70-85.
[23] WYRZYKOWSKI R, DEELMAN E, ERNST D, et al. Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs[J]. International Journal of High Performance Computing Applications (IJHPCA), 2021, 35(1): 5-19.
[24] WANG X, ZHOU Z, HU C, et al. Accelerating and tuning small matrix multiplications on sunway TaihuLight: a case study of spectral element CFD code Nek5000[J]. The International Journal of High Performance Computing Applications (IJHPCA), 2020, 34(2): 178-186.
[25] HU Y, CHEN D K, YANG C, et al. Many-core optimization of level 1 and level 2 BLAS routines on SW26010-Pro[J]. Journal of Software, 2023, 34(9): 4421-4436.
[26] TAN G, LI L, TRIECHLE S, et al. Fast implementation of DGEMM on Fermi GPU[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011: 1-11.
[27] HEINECKE A, HENRY G, HUTCHINSON M, et al. LIBXSMM: accelerating small matrix multiplications by runtime code generation[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2016: 981-991.
[28] GUNNELS J A, HENRY G M, VAN DE GEIJN R. A family of high-performance matrix multiplication algorithms[C]//Proceedings of International Conference on Computational Sciences (ICCS), 2001: 51-60.
[29] TECHOPEDIA. What does table-driven design mean[EB/OL]. [2024-05-06]. https://www.techopedia.com/definition/30408/ table-driven-design.
[30] IBM. How leading dimension is used for matrices[EB/OL]. [2024-05-06]. https://www.ibm.com/docs/en/essl/6.3?topic=matrices-how-leading-dimension-is-used.
[31] SOM. How leading dimension is used for matrices[EB/OL].[2024-05-06]. https://www.storyofmathematics.com/dimension-of-a-matrix/.
[32] QUAN T M, HILDEBRAND D G C, JEONG W K. FusionNet: a deep fully residual convolutional neural network for image segmentation in connectomics[J]. arXiv:1612.05360, 2016.
[33] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of International Conference on Learning Representations. Computational and Biological Learning Society, 2015.
[34] LIU F, MA W, ZHAO Y, et al. xMath2. 0: a high-performance extended math library for SW26010-Pro many-core processor[J]. Transactions on High Performance Computing (THPC), 2023, 5: 56-71.