基于采样的并行程序性能测量分析工具研究

doi:10.3778/j.issn.1002-8331.2307-0016

摘要/Abstract

摘要： 在实际运行中，并行计算程序的性能常常在理论峰值与预期存在较大差距。使用性能分析工具进行程序调优是解决这一问题的高效手段。然而，程序员和开发者在使用性能分析工具时往往面临选择困难、配置和使用复杂等挑战。研究基于采样的并行程序性能分析工具有助于解决上述问题。相比于插桩技术，基于异步采样的性能工具可以更好地控制测量开销和测量数据大小。着重研究了三种典型的基于采样的性能分析工具：VTune Profiler、HPCToolkit和Nsight Systems，分析了其原理和功能，并且结合VASP等实际应用程序对工具的软硬件分析能力和并行编程分析能力进行了详细的探究和对比。根据这些工具在不同的应用场景下表现出的不同适用性和分析效果，提出了综合运用多种工具进行性能分析的方案，为开发者和程序员提供有益的参考。

关键词: 性能分析工具, 异步采样, 硬件性能计数器, 并行程序, 程序调优

Abstract: The performance of parallel computing programs often has a big gap between the theoretical peak and the expectation in practice. Using performance analysis tools for program tuning is an efficient way to solve this problem. However, programmers and developers often face challenges such as difficult selection, complex configuration and complex use when using performance analysis tools. The research of sampling-based parallel program performance analysis tools is helpful to solve the above problems. Performance tools based on asynchronous sampling can better control the measurement overhead and the size of the measurement data compared to the instrumentation technology. This paper focuses on three typical sample-based performance analysis tools: VTune Profiler, HPCToolkit and Nsight Systems and analyzes the principle and the function. The software and hardware analysis capabilities and parallel programming analysis capabilities of the tools are explored and compared in detail in combination with practical applications such as VASP. According to the different applicability and analysis effect of these tools in different application scenarios, a scheme of using a variety of tools for performance analysis is proposed, which provides a useful reference for developers and programmers.

Key words: performance analysis tools, asynchronous sampling, hardware performance counter, parallel program, program tuning

胡家瑞, 石京燕, 郭超奇. 基于采样的并行程序性能测量分析工具研究[J]. 计算机工程与应用, 2024, 60(21): 286-296.

HU Jiarui, SHI Jingyan, GUO Chaoqi. Research on Performance Measurement and Analysis Tools for Parallel Programs Based on Sampling[J]. Computer Engineering and Applications, 2024, 60(21): 286-296.

参考文献

[1] LIU L, JIN Y, YI L, et al. A design of autonomous error-tolerant architectures for massively parallel computing[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018, 26(10): 2143-2154.
[2] XU Y, ZHAO Z, WU W, et al. RPPA: a remote parallel program performance analysis tool[J]. Journal of Software, 2011, 6(12): 2399-2406.
[3] 赵景元. 基于LLVM的并行程序自动性能测量的研究[D]. 上海: 华东师范大学, 2022.
ZHAO J Y. An auto performance profiling for parallel programs on LLVM[D]. Shanghai: East China Normal University, 2022.
[4] GRAHAM S L, KESSLER P B, MCKUSICK M K. Gprof: a call graph execution profiler[J]. ACM Sigplan Notices, 1982, 17(6): 120-126.
[5] SHENDE S S, MALONY A D. The TAU parallel performance system[J]. The International Journal of High Performance Computing Applications, 2006, 20(2): 287-311.
[6] MILLER B P, CALLAGHAN M D, CARGILLE J M, et al. The Paradyn parallel performance measurement tool[J]. Computer, 1995, 28(11): 37-46.
[7] ADHIANTO L, BANERJEE S, FAGAN M, et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs[J]. Concurrency & Computation Practice & Experience, 2010, 22(6): 685-701.
[8] NVIDIA Corporation. NVIDIA nsight systems[EB/OL]. [2022-12-19]. https://developer.NVIDIA.com/nsight-systems.
[9] GASTER B R, HOWES L, KAELI D R, et al. OpenCL profiling and debugging[J]. Heterogeneous Computing with OpenCL, 2013: 243-261.
[10] VETTER J S, MCCRACKEN M O. Statistical scalability analysis of communication operations in distributed applications[J]. ACM SIGPLAN Notices, 2002, 36(7): 123-132.
[11] NAGEL W E, ARNOLD A, WEBER M, et al. VAMPIR: visualization and analysis of MPI resources[J]. Supercomputer, 1996, 12(1): 69-80.
[12] Intel Corporation. Intel VTune profiler[EB/OL]. [2022-12-27]. https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html.
[13] MAROWKA A. On performance analysis of a multithreaded application parallelized by different programming models using Intel Vtune[C]//Proceedings of the 11th International Conference on Parallel Computing Technologies, Kazan, 2011: 317-331.
[14] Intel Corporation. Intel VTune profiler user guide[EB/OL]. [2022-12-27]. https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide/2023-0/overview.html.
[15] ZHOU K, KRENTEL M W, MELLOR-CRUMMEY J. Tools for top-down performance analysis of GPU-accelerated applications[C]//Proceedings of the 34th ACM International Conference on Supercomputing, Barcelona, 2020: 1-12.
[16] MALONY A D, BIERSDORFF S, SHENDE S, et al. Parallel performance measurement of heterogeneous parallel systems with GPUs[C]//Proceedings of the 2011 International Conference on Parallel Processing, Taipei, China, 2011: 176-185.
[17] WELTON B, MILLER B P. Diogenes: looking for an honest CPU/GPU performance measurement tool[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, 2019: 1-20.
[18] 潘晓东, 孙晓乐, 郑文旭, 等. 并行程序性能和功耗的测试分析工具研究[J]. 计算机技术与发展, 2021, 31(7): 69-74.
PAN X D, SUN X L, ZHENG W X, et al. A survey of performance and power measurement and analysis tools for parallel programs[J]. Computer Technology and Development, 2021, 31(7): 69-74.
[19] Rice University. HPCToolkit user’s manual[EB/OL]. [2022-12-19]. http://www.hpctoolkit.org/manual/HPCToolkit-users-manual.pdf.
[20] 张宇峰. 利用Itanium2的PMU部件开发程序性能分析工具[J]. 计算机技术与发展, 2006, 16(8): 69-71.
ZHANG Y F. Developing performance analysis tool using Itanium2 PMU[J]. Computer Technology and Development, 2006, 16(8): 69-71.
[21] COARFA C, MELLOR-CRUMMEY J M, FROYD N, et al. Scalability analysis of SPMD codes using expectations[C]//Proceedings of the 21st Annual International Conference on Supercomputing, Seattle, 2007: 13-22.
[22] 徐恒阳. 龙芯多核平台上性能分析工具的设计与实现[D]. 合肥: 中国科学技术大学, 2011.
XU H Y. Design and implementation of performance analysis tool on loongson 3A[D]. Hefei: University of Science and Technology of China, 2011.
[23] SHOJANIA H. Hardware-based performance monitoring with VTune performance analyzer under Linux[EB/OL]. [2022?12?29]. https://hassan.shojania.com/pdf/VTuneProjectReport.pdf.
[24] ZHOU K, ADHIANTO L, ANDERSON J, et al. Measurement and analysis of GPU-accelerated applications with HPCToolkit[J]. Parallel Computing, 2021, 108:102837.
[25] FROYD N, MELLOR-CRUMMEY J, FOWLER R. Low-overhead call path profiling of unmodified, optimized code[C]//Proceedings of the 19th Annual International Conference on Supercomputing, Cambridge, 2005: 81-90.
[26] NVIDIA Corporation. Nsight systems user guide[EB/OL]. [2022?12?07]. https://docs.NVIDIA.com/nsight-systems/UserGuide/index.html.
[27] KRESSE G, FURTHMULLER J. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set[J]. Computational Materials Science, 1996, 6(1): 15-50.
[28] MANUEL P. Dijkstra[EB/OL]. [2023-01-15]. https://github.com/mapa17/Dijkstra.
[29] JIAN D, YVES R, PEIMIN Z, et al. 3D time-domain electromagnetic full waveform inversion in Debye dispersive medium accelerated by multi-GPU paralleling[J]. Computer Physics Communications, 2021, 265(1):108002.
[30] Intel Corporation. Intel trace analyzer and collector[EB/OL]. [2023-01-04]. https://www.intel.cn/content/www/cn/zh/developer/tools/oneapi/trace-analyzer.html.
[31] NVIDIA Corporation. NVIDIA CUDA profiling tools interface[EB/OL]. [2023-02-07]. https://developer.NVIDIA.com/cupti-ctk11_6.
[32] NVIDIA Corporation. NVIDIA nsight compute[EB/OL]. [2022-12-19]. https://developer.NVIDIA.com/nsight-compute.
[33] NVIDIA Corporation. The NVIDIA tools extension library[EB/OL]. [2023-02-02]. https://docs.NVIDIA.com/nsight-visual-studio-edition/nvtx/index.html.