面向国产异构平台的OpenMP Offload共享内存访存优化

doi:10.3778/j.issn.1002-8331.2208-0017

摘要/Abstract

摘要： 国产异构处理器DCU（deep computing unit）上的本地数据共享（local data share，LDS）是一种低延迟、高带宽的显式寻址内存。国产异构系统的OpenMP未提供LDS访问的编程接口，导致未有效地利用LDS硬件实现数据的高效访存。针对此问题，研究了面向DCU平台的OpenMP Offload执行模式和LDS的分配方法，以及特定于LDS访存的指令结构，实现了LDS访存的手动支持。另外针对于OpenMP Offload的不同执行模式，在此优化方法的基础上实现了LDS访存的自动化，形成了一套面向国产异构平台的高效访存策略。实验采用polybench标准测试集进行测试，利用手动和自动优化方法在单线程模式下平均加速比可达2.60，利用手动优化方法在多线程non-SPMD模式下平均加速比达1.38，利用自动优化方法在多线程SPMD模式下平均加速比达1.11。实验结果表明LDS访存的自动和手动支持有助于提高OpenMP异构程序运行速度。

关键词: 国产处理器DCU, 本地数据共享（LDS）, OpenMP Offlaod, SPMD, non-SPMD

Abstract: The local data share（LDS） on the heterogeneous processor DCU（deep computing unit） is an explicit addressable memory with low latency and high bandwidth. OpenMP of heterogeneous systems made in China does not provide the programming interface for LDS access, which leads to the ineffective use of LDS hardware to achieve efficient data access and storage. Aiming at this problem, the execution mode of OpenMP Offload for DCU platform, the allocation method of LDS and the instruction structure specific to LDS memory access are studied, and the manual support of LDS memory access is realized. In addition, aiming at the different execution modes of OpenMP Offload, the automation of LDS memory access is realized on the basis of this optimization method, and a set of efficient memory access strategies for domestic heterogeneous platforms is formed. The experiment is tested by using the standard test set of polybench. The average speedup of manual and automatic optimization methods is 2.60 in single-threaded mode, 1.38 in multi-threaded non-SPMD mode by manual optimization method and 1.11 in multi-threaded SPMD mode by automatic optimization method. The experimental results show that the automatic and manual support of LDS memory access is helpful to improve the running speed of OpenMP heterogeneous programs.

Key words: domestic processor DCU, local data share（LDS）, OpenMP Offload, SPMD, non-SPMD

王鑫, 李嘉楠, 韩林, 赵荣彩, 周强伟. 面向国产异构平台的OpenMP Offload共享内存访存优化[J]. 计算机工程与应用, 2023, 59(10): 75-85.

WANG Xin, LI Jianan, HAN Lin, ZHAO Rongcai, ZHOU Qiangwei. Optimization of OpenMP Offload Shared Memory Access for Domestic Heterogeneous Platforms[J]. Computer Engineering and Applications, 2023, 59(10): 75-85.

参考文献

[1] HOLZINGER P，REICHENBACH M，FEY D.A new generic HLS approach for heterogeneous computing：on the feasibility of high-level synthesis in HSA-compatible systems[C]//Proceedings of the 18th International Conference on Embedded Computer Systems：Architectures，Modeling，and Simulation.New York：ACM，2018：18-27.
[2] BREYER M，VAN CRAEN A，PFLüGER D.A comparison of SYCL，OpenCL，CUDA，and OpenMP for massively parallel support vector machine classification on multi-vendor hardware[C]//Proceedings of the 2022 International Workshop on OpenCL.New York：ACM，2022：2.
[3] WU M Y，ZHANG L M，LIU C，et al.Automating CUDA synchronization via program transformation[C]//Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering.Piscataway：IEEE，2019：748-759.
[4] HUBER J，CORNELIUS M，GEORGAKOUDIS G，et al.Efficient execution of OpenMP on GPUs[C]//Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization.Piscataway：IEEE，2022：41-52.
[5] DALEY C S，SOUTHWELL A，GAYATRI R，et al.Non-recurring engineering（NRE） best practices：a case study with the NERSC/NVIDIA OpenMP contract[C]//Proceedings of the International Conference for High Performance Computing，Networking，Storage and Analysis.New York：ACM，2021：31.
[6] MISHRA A，LI L，KONG M，et al.Benchmarking and evaluating unified memory for OpenMP GPU Offloading[C]//Proceedings of the 4th Workshop on the LLVM Compiler Infrastructure in HPC.New York：ACM，2017：6.
[7] CHIKIN A，LLOYD T，AMARAL J N，et al.Memory-access-aware safety and profitability analysis for transformation of accelerator-bound OpenMP loops[J].ACM Transactions on Architecture and Code Optimization，2019，16（3）：30.
[8] ALCARAZ J，SIKORA A，CESAR E.Dynamic tuning of OpenMP memory bound applications in multisocket systems using MATE[C]//Proceedings of the 47th International Conference on Parallel Processing Companion.New York：ACM，2018：37.
[9] DOERFERT J，HUBER J，CORNELIUS M.Advancing OpenMP Offload debugging capabilities in LLVM[C]//Proceedings of the 50th International Conference on Parallel Processing Workshop.New York：ACM，2021：20.
[10] KURTH A，WOLTERS K，FORSBERG B，et al.Mixed-data-model heterogeneous compilation and OpenMP offloading[C]//Proceedings of the 29th International Conference on Compiler Construction.New York：ACM，2020：119-131.
[11] 段皞一.MPI、OpenMP、Taichi并行编程语言探究[J].电子元器件与信息技术，2022，6（4）：123-134.
DUAN H Y.Research on MPI，OpenMP and Taichi parallel programming languages[J].Electronic Components and Information Technology，2022，6（4）：123-134.
[12] 高雨辰.面向国产处理器的OpenMP程序编译优化技术研究[D].郑州：战略支援部队信息工程大学，2018.
GAO Y C.Research on OpenMP program compilation and optimization techniques for domestic processors[D].Zhengzhou：Information Engineering University，2018.
[13] 郭浩男.面向异构平台的OpenMP程序自动卸载及优化[D].哈尔滨：哈尔滨工业大学，2020.
GUO H N.Automatic offloading and optimization of OpenMP programs for heterogeneous platforms[D].Harbin：Harbin Institute of Technology，2020.
[14] PATEL A，TIAN S L，DOERFERT J，et al.A virtual GPU as developer-friendly OpenMP Offload target[C]//Proceedings of the 50th International Conference on Parallel Processing Workshop.New York：ACM，2021：24.
[15] PATEL A，DOERFERT J.Remote OpenMP offloading[C]//Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.New York：ACM，2022：441-442.
[16] OZEN G，WOLFE M.Performant portable OpenMP[C]//Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction.New York：ACM，2022：156-168.
[17] MARIN A，ROSSI S，WILLIAMSON C.Speed scaling in fork-join queues：a comparative study[C]//Proceedings of the 13th EAI International Conference on Performance Evaluation Methodologies and Tools.New York：ACM，2020：80-87.
[18] ROSALES E，ROSà A，BINDER W.Optimization coaching for fork/join applications on the Java virtual machine[C]//Proceedings of the 3rd International Conference on Art，Science，and Engineering of Programming.New York：ACM，2019：7.
[19] THEODORIDIS T，RIGGER M，SU Z D.Finding missed optimizations through the lens of dead code elimination[C]//Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems.New York：ACM，2022：697-709.
[20] 李雁冰，赵荣彩，赵博，等.面向异构多核处理器的循环分块[J].计算机工程与设计，2015，36（1）：168-173.
LI Y B，ZHAO Y C，ZHAO B，et al.Loop tiling for heterogeneous multi-core processor[J].Computer Engineering and Design，2015，36（1）：168-173.