Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (6): 150-163.DOI: 10.3778/j.issn.1002-8331.2405-0142

• Theory, Research and Development • Previous Articles     Next Articles

Performance Optimization Techniques of Irregular-Shaped Matrix Multiplication on SW26010P

HU Yi, CHEN Daokun, YANG Chao   

  1. 1.School of Mathematical Sciences, Peking University, Beijing 100871, China
    2.Research Center of Advanced Computing, Changsha Institute for Computing and Digital Economy, Peking University, Changsha  410205, China
    3.Laboratory of Parallel Software and Computational Science,Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
  • Online:2025-03-15 Published:2025-03-14

面向SW26010P的异形矩阵乘法众核并行优化技术研究

胡怡,陈道琨,杨超   

  1. 1.北京大学 数学科学学院,北京 100871
    2.北京大学 长沙计算与数字经济研究院 先进计算研究中心,长沙 410205
    3.中国科学院 软件研究所 并行软件与计算科学实验室,北京 100190

Abstract: Matrix multiplication is widely used in the field of scientific and engineering computing, and is the most important optimization object in BLAS. With the development of artificial neural networks, computational fluid mechanics and other fields, irregular-shaped matrix multiplication is rapidly gaining attention. This paper proposes parallelization techniques for irregular-shaped matrix multiplication on SW26010P, a domestic many-core processor deployed in the new generation Sunway supercomputer. Specifically, a parallel algorithm with diversified task partition mapping is designed to improve memory access bandwidth utilization rate based on the hardware characteristics and the data layout of matrix elements. At the same time, based on the hardware assembly lines and vectorized computation and data access instructions, the key computations are abstracted and the corresponding underlying compilation optimizations are performed to improve computational efficiency. And a data-sharing strategy under the RMA point to point communication mechanism is adopted to further reduce the overhead of data access and transmission, and the nested double buffering are used to further improve the performance. Besides, a series of experiments on SW26010P are conducted to determine the optimal number of blocks of different kinds of function parallelization calculation for the purpose of making full use of the hardware platform performance. The experimental results demonstrate that the performance of the irregular-shaped matrix multiplication optimized in this thesis can reach up to 93% of the upper bound of the theoretical performance. Compared with the massive GEMM algorithm implementation, the average performance acceleration of the irregular-shaped matrix multiplication is 5.43 times, and the optimal performance acceleration can reach up to 51.5 times.

Key words: irregular-shaped matrix multiplication, Sunway 26010P many-core processor, diversified task partition mapping, RMA point to point mechanism, nested double buffering techniques

摘要: 矩阵乘法广泛应用于科学与工程计算领域,是基础线性代数库中的关键优化对象。随着人工神经网络、计算流体力学等领域的快速发展,异形(irregular-shaped)矩阵乘法正在迅速引起关注。研究集中在针对国产新一代神威超级计算机采用的SW26010P众核处理器,探讨异形矩阵乘法的众核并行优化技术。具体而言,结合SW26010P的硬件特性和异形矩阵的数据布局,设计了多样化任务划分映射的并行算法,提高直接内存访问(direct memory access,DMA)访存带宽利用率。结合SW26010P的硬件流水线和向量化访存/计算指令,抽象运算中涉及的计算类型进行底层汇编优化,提高了计算效率。提出了远程内存访问(remote memory access,RMA)点对点机制下的数据共享策略,降低数据访存和传输开销,并提出了嵌套双缓冲技术进一步提高异形矩阵乘法的性能。此外,针对不同种类异形矩阵乘法行实现时面临的分块参数适配问题,基于SW26010P众核处理器进行实验分析研究,确定了各函数并行化时的最优分块参数。实验结果显著,所优化的异形矩阵乘法的性能最高可达roofline模型预测性能上限的93%,相较于常规大规模矩阵乘法算法平均获得了5.43倍的性能加速,最高可获得51.5倍的性能加速。

关键词: 异形矩阵乘法, SW26010P众核处理器, 多样化任务划分映射, RMA点对点机制, 嵌套双缓冲技术