计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (10): 75-85.DOI: 10.3778/j.issn.1002-8331.2208-0017

• 理论与研发 • 上一篇    下一篇

面向国产异构平台的OpenMP Offload共享内存访存优化

王鑫,李嘉楠,韩林,赵荣彩,周强伟   

  1. 1.郑州大学 计算机与人工智能学院,郑州 450001
    2.国家超级计算郑州中心(郑州大学),郑州 450001
  • 出版日期:2023-05-15 发布日期:2023-05-15

Optimization of OpenMP Offload Shared Memory Access for Domestic Heterogeneous Platforms

WANG Xin, LI Jianan, HAN Lin, ZHAO Rongcai, ZHOU Qiangwei   

  1. 1.School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
    2.National Supercomputing Center in Zhengzhou (Zhengzhou University), Zhengzhou 450001, China
  • Online:2023-05-15 Published:2023-05-15

摘要: 国产异构处理器DCU(deep computing unit)上的本地数据共享(local data share,LDS)是一种低延迟、高带宽的显式寻址内存。国产异构系统的OpenMP未提供LDS访问的编程接口,导致未有效地利用LDS硬件实现数据的高效访存。针对此问题,研究了面向DCU平台的OpenMP Offload执行模式和LDS的分配方法,以及特定于LDS访存的指令结构,实现了LDS访存的手动支持。另外针对于OpenMP Offload的不同执行模式,在此优化方法的基础上实现了LDS访存的自动化,形成了一套面向国产异构平台的高效访存策略。实验采用polybench标准测试集进行测试,利用手动和自动优化方法在单线程模式下平均加速比可达2.60,利用手动优化方法在多线程non-SPMD模式下平均加速比达1.38,利用自动优化方法在多线程SPMD模式下平均加速比达1.11。实验结果表明LDS访存的自动和手动支持有助于提高OpenMP异构程序运行速度。

关键词: 国产处理器DCU, 本地数据共享(LDS), OpenMP Offlaod, SPMD, non-SPMD

Abstract: The local data share(LDS) on the heterogeneous processor DCU(deep computing unit) is an explicit addressable memory with low latency and high bandwidth. OpenMP of heterogeneous systems made in China does not provide the programming interface for LDS access, which leads to the ineffective use of LDS hardware to achieve efficient data access and storage. Aiming at this problem, the execution mode of OpenMP Offload for DCU platform, the allocation method of LDS and the instruction structure specific to LDS memory access are studied, and the manual support of LDS memory access is realized. In addition, aiming at the different execution modes of OpenMP Offload, the automation of LDS memory access is realized on the basis of this optimization method, and a set of efficient memory access strategies for domestic heterogeneous platforms is formed. The experiment is tested by using the standard test set of polybench. The average speedup of manual and automatic optimization methods is 2.60 in single-threaded mode, 1.38 in multi-threaded non-SPMD mode by manual optimization method and 1.11 in multi-threaded SPMD mode by automatic optimization method. The experimental results show that the automatic and manual support of LDS memory access is helpful to improve the running speed of OpenMP heterogeneous programs.

Key words: domestic processor DCU, local data share(LDS), OpenMP Offload, SPMD, non-SPMD