[1] 闫昊.面向申威1621国产多核处理器的稠密矩阵运算并行算法和性能优化技术研究[D]. 北京: 中国科学院大学,2021.
YAN H. Research on parallel algorithm and performance optimization technology of dense matrix computing for Shenwei 1621 domestic multicore processor[D]. Beijing: University of Chinese Academy of Sciences, 2021.
[2] KING D E. Dlib-ml: a machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10(3): 1755-1758.
[3] DEM?AR J, CURK T, ERJAVEC A, et al. Orange: data mining toolbox in python[J]. Journal of Machine Learning Research, 2013, 14(1): 2349-2353.
[4] ABHYANKAR S, BETRIE G, MALDONADO D A, et al. PETSc DMNetwork: a library for scalable network PDE-based multiphysics simulations[J]. Transactions on Mathematical Software (TOMS), 2020, 46(1): 1-24.
[5] ABADI M, BARHAM P, CHEN J, et al. TensorFlow: a system for large-scale machine learning[C]//Proceedings of USENIX Conference on Operating Systems Design and Implementation (OSDI), 2016: 265-283.
[6] KARNIADAKIS G, SHERWIN S. Spectral/hp element methods for computational fluid dynamics[M]. Oxford: Oxford University Press, 2005.
[7] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[8] DHILLON I S, GUAN Y, KULIS B. Kernel k-means: spectral clustering and normalized cuts[C]//Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), 2004: 551-556.
[9] K-means by NVIDIA[EB/OL]. [2024-05-06]. https://github. com/NVIDIA/kmeans.
[10] LIU Y, LIU X, LI F, et al. Closing the “quantum supremacy” gap: achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021: 1-12.
[11] SHANG H, LI F, ZHANG Y, et al. Extreme-scale ab initio quantum raman spectra simulations on the leadership HPC system in China[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021.
[12] XIAO J, CHEN J, ZHENG J, et al. Symplectic structure-preserving particle-in-cell whole-volume simulation of tokamak plasmas to 111.3 trillion particles and 25.7 billion grids[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021.
[13] GOTO K, VAN DE GEIJN R. High-performance implementation of the Level-3 BLAS[J]. Transactions on Mathematical Software (TOMS), 2008, 35(1): 1-14.
[14] WILLIAMS S, WATERMAN A, PATTERSON D. Roofline: an insightful visual performance model for floating-point programs and multicore architectures[J]. Communications of the Association for Computing Machinery, 2009, 52(4): 65-76.
[15] AUER A, BAUMGARTNER G, BERNHOLDT D, et al. Automatic code generation for many-body electronic structure methods: the tensor contraction engine[J]. Molecular Physics, 2006, 104(2): 211-228.
[16] KHODAYARI A, ZOMORRODI A, LIAO J, et al. A kinetic model of Escherichia coli core metabolism satisfying multiple sets of mutant flux data[J]. Metabolic Engineering, 2014, 25: 50-62.
[17] HU Y, CHEN D K, YANG C, et al. Optimization techniques of parallel level 3 BLAS routines on domestic SW26010-Pro many-core processor[J]. Journal of Software, 2024, 35(3): 1569-1584.
[18] NERSC. Roofline performance model[EB/OL]. [2024-05-06]. https://docs.nersc.gov/tools/performance/roofline/.
[19] HAIDAR A, ABDELFATTAH A, ZOUNON M, et al. A guide for achieving high performance with very small matrices on GPU: a case study of batched LU and cholesky factorizations[J]. Transactions on Parallel and Distributed Systems, 2018, 29(5): 973-984.
[20] FRISON G, KOUZOUPIS D, SARTOR T, et al. BLASFEO: basic linear algebra subroutines for embedded optimization[J]. Transactions on Mathematical Software, 2018, 44(4): 42.
[21] YANG W L, FANG J B, DONG D Z, et al. LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021: 1-14.
[22] RIVERA C, CHEN J, XIONG N, et al. TSM2X: high-performance tall-and-skinny matrix-matrix multiplication on GPUs[J]. Journal of Parallel and Distributed Computing, 2021, 151(3): 70-85.
[23] WYRZYKOWSKI R, DEELMAN E, ERNST D, et al. Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs[J]. International Journal of High Performance Computing Applications (IJHPCA), 2021, 35(1): 5-19.
[24] WANG X, ZHOU Z, HU C, et al. Accelerating and tuning small matrix multiplications on sunway TaihuLight: a case study of spectral element CFD code Nek5000[J]. The International Journal of High Performance Computing Applications (IJHPCA), 2020, 34(2): 178-186.
[25] HU Y, CHEN D K, YANG C, et al. Many-core optimization of level 1 and level 2 BLAS routines on SW26010-Pro[J]. Journal of Software, 2023, 34(9): 4421-4436.
[26] TAN G, LI L, TRIECHLE S, et al. Fast implementation of DGEMM on Fermi GPU[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011: 1-11.
[27] HEINECKE A, HENRY G, HUTCHINSON M, et al. LIBXSMM: accelerating small matrix multiplications by runtime code generation[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2016: 981-991.
[28] GUNNELS J A, HENRY G M, VAN DE GEIJN R. A family of high-performance matrix multiplication algorithms[C]//Proceedings of International Conference on Computational Sciences (ICCS), 2001: 51-60.
[29] TECHOPEDIA. What does table-driven design mean[EB/OL]. [2024-05-06]. https://www.techopedia.com/definition/30408/ table-driven-design.
[30] IBM. How leading dimension is used for matrices[EB/OL]. [2024-05-06]. https://www.ibm.com/docs/en/essl/6.3?topic=matrices-how-leading-dimension-is-used.
[31] SOM. How leading dimension is used for matrices[EB/OL].[2024-05-06]. https://www.storyofmathematics.com/dimension-of-a-matrix/.
[32] QUAN T M, HILDEBRAND D G C, JEONG W K. FusionNet: a deep fully residual convolutional neural network for image segmentation in connectomics[J]. arXiv:1612.05360, 2016.
[33] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//Proceedings of International Conference on Learning Representations. Computational and Biological Learning Society, 2015.
[34] LIU F, MA W, ZHAO Y, et al. xMath2. 0: a high-performance extended math library for SW26010-Pro many-core processor[J]. Transactions on High Performance Computing (THPC), 2023, 5: 56-71. |