
计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (15): 72-92.DOI: 10.3778/j.issn.1002-8331.2409-0342
李道童,李盛新,王兵,姚藩益,芦飞,艾山彬,张炳会,孙秀强,王若琳
出版日期:2025-08-01
发布日期:2025-07-31
LI Daotong, LI Shengxin, WANG Bing, YAO Fanyi, LU Fei, AI Shanbin, ZHANG Binghui, SUN Xiu-qiang, WANG Ruolin
Online:2025-08-01
Published:2025-07-31
摘要: 内存作为服务器中的核心部件,随着内存技术的持续迭代与性能的显著提升,其可靠性问题已成为影响服务器整体稳定性不可忽视的关键因素。回顾了内存技术的演进历程、结构特性及其发展对服务器性能的直接影响,深入剖析了内存故障模式的多样性与深层次复杂性。进一步地,详尽探讨了故障检测与处理的最新技术进展,特别强调了内存纠错码和内存故障容错技术的重要作用,并聚焦于内存风险单元预测技术的前沿探索,尤其是结合确定性规则或机器学习算法的内存故障预测方法。在此基础上,对当前内存可靠性领域面临的核心挑战进行了系统性分析,并前瞻性地展望了未来研究方向,涵盖内存老化精准预测、健康状态实时监测及机器学习在预测分析中的深度应用等关键领域。最终强调,在追求服务器内存性能极致化的同时,必须并行提升其稳定性与可靠性,以适应日益增长的服务器性能需求,为内存可靠性技术的未来发展提供了宝贵的实践指导与理论参考。
李道童, 李盛新, 王兵, 姚藩益, 芦飞, 艾山彬, 张炳会, 孙秀强, 王若琳. 服务器内存可靠性技术研究综述[J]. 计算机工程与应用, 2025, 61(15): 72-92.
LI Daotong, LI Shengxin, WANG Bing, YAO Fanyi, LU Fei, AI Shanbin, ZHANG Binghui, SUN Xiu-qiang, WANG Ruolin. Review of Server Memory Reliability Technology[J]. Computer Engineering and Applications, 2025, 61(15): 72-92.
| [1] BOGATINOVSKI J, KAO O, YU Q, et al. First CE matters: on the importance of long term properties on memory failure prediction[C]//Proceedings of the 2022 IEEE International Conference on Big Data. Piscataway: IEEE, 2022: 4733-4736. [2] DAYANAND N, QUAH A C T, CHEN C Q, et al. Static fault localization on memory failures using Photon Emission Microscopy[C]//Proceedings of the 2015 IEEE 22nd International Symposium on the Physical and Failure Analysis of Integrated Circuits. Piscataway: IEEE, 2015: 322-326. [3] 王群, 李馥娟, 倪雪莉, 等. 域间路由安全增强及区块链技术的应用研究[J]. 计算机科学与探索, 2024, 18(12): 3144-3174. WANG Q, LI F J, NI X L, et al. Research on blockchain-based inter-domain routing security enhancement[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(12): 3144-3174. [4] 张格毅, 陈小刚, 郭继鹏, 等. 利用相变存储器不对称性的写入优化方法[J]. 计算机工程与应用, 2021, 57(14): 75-82. ZHANG G Y, CHEN X G, GUO J P, et al. Solution to optimize PCM write depending on asymmetries[J]. Computer Engineering and Applications, 2021, 57(14): 75-82. [5] LEE J, KIM M J, KIM W S, et al. Review of memory RAS for data centers[J]. IEEE Access, 2023, 11: 124782-124796. [6] KIM D, KIM J. Adaptive granularity on-die ECC[C]//Proceedings of the 2022 19th International SoC Design Conference. Piscataway: IEEE, 2022: 318-319. [7] CHENG Z N, HAN S J, LEE P P C, et al. An in-depth correlative study between DRAM errors and server failures in production data centers[C]//Proceedings of the 2022 41st International Symposium on Reliable Distributed Systems. Piscataway: IEEE, 2022: 262-272. [8] FEVGAS A, AKRITIDIS L, ALAMANIOTIS M, et al. HyR-tree: a spatial index for hybrid flash/3D XPoint storage[J]. Neural Computing and Applications, 2023, 35(1): 133-145. [9] CARNIEL A C, AGUIAR C D. Spatial index structures for modern storage devices: a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(9): 9578-9597. [10] SMAGULOVA K, FOUDA M E, ELTAWIL A. Thermal heating in ReRAM crossbar arrays: challenges and solutions[J]. IEEE Open Journal of Circuits and Systems, 2024, 5: 28-41. [11] LEE S, LEE N H, LEE K, et al. Development and product reliability characterization of advanced high speed 14nm DDR5 DRAM with on-die ECC[C]//Proceedings of the 2023 IEEE International Reliability Physics Symposium. Piscataway: IEEE, 2023: 1-4. [12] SEO H, RIM T, LEE E, et al. Analysis of intermittent single-bit failure on 10-nm node generation DRAM devices[C]//Proceedings of the 2023 IEEE International Reliability Physics Symposium. Piscataway: IEEE, 2023: 1-6. [13] SPESSOT A, OH H. 1T-1C dynamic random access memory status, challenges, and prospects[J]. IEEE Transactions on Electron Devices, 2020, 67(4): 1382-1393. [14] BRACKMANN L, JAFARI A, BENGEL C, et al. A failure analysis framework of ReRAM in-memory logic operations[C]//Proceedings of the 2022 IEEE International Test Conference in Asia. Piscataway: IEEE, 2022: 67-72. [15] SRIDHARAN V, DEBARDELEBEN N, BLANCHARD S, et al. Memory errors in modern systems: the good, the bad, and the ugly[C]//Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2015: 297-310. [16] ZHANG P C, WANG Y N, MA X H, et al. Predicting DRAM-caused node unavailability in hyper-scale clouds[C]//Proceedings of the 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2022: 275-286. [17] MAYUGA G, YAMATO Y, YONEDA T, et al. Reliability-enhanced ECC-based memory architecture using in-field self-repair[J]. IEICE Transactions on Information and Systems, 2016, 99(10): 2591-2599. [18] 许健, 陈平华, 熊建斌. 基于抽象内存模型的内存相关漏洞检测方法[J]. 计算机工程与应用, 2022, 58(8): 96-108. XU J, CHEN P H, XIONG J B. Memory-related vulnerability detection method based on abstract memory model[J]. Computer Engineering and Applications, 2022, 58(8): 96-108. [19] 路杉杉. March算法研究及其分析软件实现[D]. 成都: 电子科技大学, 2023. LU S S. March algorithm study and its analysis software implementation[D]. Chengdu: University of Electronic Science and Technology of China, 2023. [20] 王恩笙. 一种March类存储器测试序列解析方法的设计与实现[D]. 成都: 电子科技大学, 2024. WANG E S. Design and implementation of a method for analysis of march-type memory test sequences[D]. Chengdu: University of Electronic Science and Technology of China, 2024. [21] 葛云侠, 陈龙, 解维坤, 等. 大规模芯片内嵌存储器的BIST测试方法研究[J]. 国外电子测量技术, 2024, 43(5): 18-25. GE Y X, CHEN L, XIE W K, et al. Research on BIST testing method for large-scale chip embedded memory[J]. Foreign Electronic Measurement Technology, 2024, 43(5): 18-25. [22] BLAKE I F. Error control coding (S. Lin and D. J. Costello; 2004) [book review][J]. IEEE Transactions on Information Theory, 2005, 51(4): 1616-1617. [23] KWON J H, BAE H K, LEE Y S, et al. ZEC ECC: a zero-byte eliminating compression-based ECC scheme for DRAM reliability[J]. IEEE Access, 2024, 12: 100366-100376. [24] BAE H K, CHUNG M J, GONG Y H, et al. Twin ECC: a data duplication based ECC for strong DRAM error resilience[C]//Proceedings of the 2023 Design, Automation & Test in Europe Conference & Exhibition. Piscataway: IEEE, 2023: 1-6. [25] LIN S, COSTELLO D J. Error control coding[M]. USA: Prentice-Hall, 2004: 357-762. [26] REVIRIEGO P, LIU S S, XIAO L Y, et al. An efficient single and double-adjacent error correcting parallel decoder for the (24, 12) extended golay code[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016, 24(4): 1603-1606. [27] BENTOUTOU Y. A real time EDAC system for applications onboard earth observation small satellites[J]. IEEE Transactions on Aerospace and Electronic Systems, 2012, 48(1): 648-657. [28] LEE H, YOO Y, SHIN S H, et al. ECMO: ecc architecture reusing content-addressable memories for obtaining high reliability in DRAM[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(6): 781-793. [29] HAYAKAWA A, NAKAMURA T, DEGUCHI Y, et al. Data-aware partial ECC with data modulation of ReRAM with non-volatile in-memory computing for image recognition with deep neural network[C]//Proceedings of the 2018 IEEE International Symposium on Circuits and Systems. Piscataway: IEEE, 2018: 1-5. [30] DAS H, AHMAD HAIDOUS A, SMITH S C, et al. Flexible low-cost power-efficient video memory with ECC-adaptation[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021, 29(10): 1693-1706. [31] KIM K, LIM S H. Compression and variable-sized ECC scheme for the reliable flash memory system[C]//Advances in Computer Science and Ubiquitous Computing. Singapore: Springer Singapore, 2017: 1232-1236. [32] ALACCHI A, GIACOMIN E, TEMPLE S, et al. Low latency SEU detection in FPGA CRAM with in-memory ECC checking[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(5): 2028-2036. [33] WU S Z, DU C F, ZHU W D, et al. EaD: ecc-assisted deduplication with high performance and low memory overhead for ultra-low latency flash storage[J]. IEEE Transactions on Computers, 2023, 72(1): 208-221. [34] 田欢. 低冗余存储器相邻双错误纠正码设计[D]. 哈尔滨: 哈尔滨工业大学, 2011. TIAN H. Low redundancy adjacent double: errors correction codes in memory[D]. Harbin: Harbin Institute of Technology, 2011. [35] 王丹宁. 面向连续多位翻转的纠错码研究与实现[D]. 长沙: 国防科技大学, 2021. WANG D N. The research and implementation of error correction code for continuous multi-bit upsets[D]. Changsha: National University of Defense Technology, 2021. [36] 安天乐. 高效片上存储纠错技术研究与实现[D]. 长沙: 国防科技大学, 2018. AN T L. The research and implementation of efficient error correction technology for on-chip memory[D]. Changsha: National University of Defense Technology, 2018. [37] LI J, REVIRIEGO P, XIAO L, et al. Protecting memories against soft errors: the case for customizable error correction codes[J]. IEEE Transactions on Emerging Topics in Computing, 2021, 9(2): 651-663. [38] BENSO A, CHIUSANO S, DI NATALE G, et al. An on-line BIST RAM architecture with self-repair capabilities[J]. IEEE Transactions on Reliability, 2002, 51(1): 123-128. [39] BUCH S. Error detecting and correcting codes for DRAM functional safety[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2023: 1-5. [40] DU X M, LI C. Predicting uncorrectable memory errors from the correctable error history: no free predictors in the field[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2021: 1-10. [41] DELL T J. System RAS implications of DRAM soft errors[J]. IBM Journal of Research and Development, 2008, 52(3): 307-314. [42] KIM J, KWON S, NOH J, et al. Construction of cyclic redundancy check codes for SDDC decoding in DRAM systems[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2023, 70(2): 736-740. [43] LI W, ZHANG M, GUI T W, et al. Improving DRAM reliability using a high order error correction code[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 43(12): 4775-4785. [44] WARNES L, CALHOUN M B, CARR D, et al. Rank sparing system and method: US8892942[P]. 2014-11-18. [45] DU X M, LI C, ZHOU S, et al. Predicting uncorrectable memory errors for proactive replacement: an empirical study on large-scale field data[C]//Proceedings of the 2020 16th European Dependable Computing Conference. Piscataway: IEEE, 2020: 41-46. [46] GIURGIU I, SZABO J, WIESMANN D, et al. Predicting DRAM reliability in the field with machine learning[C]//Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track. New York: ACM, 2017: 15-21. [47] BOIXADERAS I, ZIVANOVIC D, MORé S, et al. Cost-aware prediction of uncorrected DRAM errors in the field[C]//Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-15. [48] CRISS K, BAINS K, AGARWAL R, et al. Improving memory reliability by bounding DRAM faults: ddr5 improved reliability features[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2020: 317-322. [49] MEZA J, WU Q, KUMAR S, et al. Revisiting memory errors in large-scale production data centers: analysis and modeling of new trends from the field[C]//Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2015: 415-426. [50] DU X M, LI C, ZHOU S, et al. Fault-aware prediction-guided page offlining for uncorrectable memory error prevention[C]//Proceedings of the 2021 IEEE 39th International Conference on Computer Design. Piscataway: IEEE, 2021: 456-463. [51] DU X M, LI C. DPCLS: improving partial cache line sparing with dynamics for memory error prevention[C]//Proceedings of the 2020 IEEE 38th International Conference on Computer Design. Piscataway: IEEE, 2020: 197-204. [52] DU X M, LI C. Combining error statistics with failure prediction in memory page offlining[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2019: 127-132. [53] COSTA C H A, PARK Y, ROSENBURG B S, et al. A system software approach to proactive memory-error avoidance[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2014: 707-718. [54] LI C, ZHANG Y, WANG J L, et al. From correctable memory errors to uncorrectable memory errors: what error bits tell[C]//Proceedings of the SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2022: 1-14. [55] JUNG J, EREZ M. Predicting future-system reliability with a component-level DRAM fault model[C]//Proceedings of the 2023 56th IEEE/ACM International Symposium on Microarchitecture. Piscataway: IEEE, 2023: 944-956. [56] BREITENBACH T, MALAVALLI DIVAKAR S, RASBACH L, et al. ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination[J]. Journal of Parallel and Distributed Computing, 2024, 185: 104800. [57] YU Q, ZHANG W G, NOTARO P, et al. HiMFP: hierarchical intelligent memory failure prediction for cloud service reliability[C]//Proceedings of the 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2023: 216-228. [58] BASEMAN E, DEBARDELEBEN N, FERREIRA K, et al. Improving DRAM fault characterization through machine learning[C]//Proceedings of the 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop. Piscataway: IEEE, 2016: 250-253. [59] BASEMAN E, DEBARDELEBEN N, FERREIRA K, et al. Automating DRAM fault mitigation by learning from experience[C]//Proceedings of the 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops. Piscataway: IEEE, 2017: 137-140. [60] GIURGIU I, WIESMANN D, BIRD J. “Memory loss” in commodity hardware : predicting DIMM failures with machine learning[C]//Proceedings of the 10th ACM International Systems and Storage Conference. New York: ACM, 2017. [61] YU Q, CARDOSO J, KAO O. Unveiling DRAM failures across different CPU architectures in large-scale datacenters[C]//Proceedings of the 2024 IEEE 44th International Conference on Distributed Computing Systems. Piscataway: IEEE, 2024: 1462-1463. [62] YU F Y, XU H Z, JIAN S L, et al. DRAM failure prediction in large-scale data centers[C]//Proceedings of the 2021 IEEE International Conference on Joint Cloud Computing. Piscataway: IEEE, 2021: 1-8. [63] WANG X Y, LI Y, CHEN Y Q, et al. On workload-aware DRAM failure prediction in large-scale data centers[C]//Proceedings of the 2021 IEEE 39th VLSI Test Symposium. Piscataway: IEEE, 2021: 1-6. [64] DU X M, LI C. Memory failure prediction using online learning[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2018: 38-49. [65] BASEMAN E, DEBARDELEBEN N, BLANCHARD S, et al. Physics-informed machine learning for DRAM error modeling[C]//Proceedings of the 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems. Piscataway: IEEE, 2018: 1-6. [66] ZHENG X, WU L Z, DONG D N, et al. Endurance prediction based on hidden Markov model and programming optimization for 28 nm 1 Mbit resistive random access memory chip[J]. IEEE Electron Device Letters, 2023, 44(6): 919-922. [67] 李盛新, 李道童, 贾帅帅, 等. 内存状态检测方法、装置、通信设备及存储介质: CN202310935420. 8[P]. 2023-11-03. LI S X, LI D T, JIA S S, et al. Memory state detection method and apparatus, and communication device and storage medium: CN202310935420. 8[P]. 2023-11-03. [68] WU R L, ZHOU S Y, LU J H, et al. Removing obstacles before breaking through the memory wall: a close look at HBM errors in the field[C]//Proceedings of the 2024 Usenix Annual Technical Conference, 2024: 851-867. [69] WANG J, YANG H Z, LI C, et al. Boosting data center performance via intelligently managed multi-backend disaggregated memory[C]//Proceedings of the SC24: International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2024: 1-18. [70] LI P F, HUA Y, ZUO P F, et al. A high-performance RDMA-oriented learned key-value store for disaggregated memory systems[J]. ACM Transactions on Storage, 2023, 19(4): 1-30. [71] LI H C, BERGER D S, HSU L, et al. Pond: cxl-based memory pooling systems for cloud platforms[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. New York: ACM, 2023: 574-587. [72] LEE H, CHOI K, LEE H J, et al. SDM: sharing-enabled disaggregated memory system with cache coherent compute express link[C]//Proceedings of the 2023 32nd International Conference on Parallel Architectures and Compilation Techniques, 2023: 86-98. |
| [1] | 吴瑞琪, 周毅. 知识融入多源多任务学习的眼底图像分类方法[J]. 计算机工程与应用, 2025, 61(7): 255-266. |
| [2] | 魏佳妹, 袁书娟, 孔闪闪, 杨爱民, 赵晨颖. 轻梯度提升机算法的发展与应用[J]. 计算机工程与应用, 2025, 61(5): 32-42. |
| [3] | 赵婵婵, 吕飞, 石宝, 尉晓敏, 杨星辰, 岳效灿. 面向边缘智能的协同推理方法研究综述[J]. 计算机工程与应用, 2025, 61(3): 1-20. |
| [4] | 方岢愿, 许珂维. LLMs与ML优势互补:政务回复质量检测及可解释的算法框架[J]. 计算机工程与应用, 2025, 61(16): 146-159. |
| [5] | 万季玲, 曹利峰, 白金龙, 李金辉, 杜学绘. 面向区块链网络的异常检测方法综述[J]. 计算机工程与应用, 2025, 61(13): 78-99. |
| [6] | 彭晏飞, 郭家隆, 黄瑾, 郑宏威, 王庚哲. 基于网络流量的挖矿币种识别方法研究[J]. 计算机工程与应用, 2025, 61(13): 200-207. |
| [7] | 胡翔坤, 李华, 冯毅雄, 钱松荣, 李键, 李少波. 基于深度学习的基础设施表面裂纹检测方法研究进展[J]. 计算机工程与应用, 2025, 61(1): 1-23. |
| [8] | 黄施洋, 奚雪峰, 崔志明. 大模型时代下的汉语自然语言处理研究与探索[J]. 计算机工程与应用, 2025, 61(1): 80-97. |
| [9] | 裴文灿, 孙光伟, 黄金国, 徐丁辉, 刘竞. 田间即时鲜烟叶SPAD值预测和成熟度识别方法[J]. 计算机工程与应用, 2024, 60(8): 348-360. |
| [10] | 邢长征, 徐佳玉. LightGBM混合模型在乳腺癌诊断中的应用[J]. 计算机工程与应用, 2024, 60(6): 330-338. |
| [11] | 姜璐璐, 高锦涛. 面向机器学习的数据库参数调优技术综述[J]. 计算机工程与应用, 2024, 60(3): 1-16. |
| [12] | 吴海涛, 蔡咏琦, 高建华. Bagging异构集成的代码异味检测与重构优先级划分[J]. 计算机工程与应用, 2024, 60(3): 138-147. |
| [13] | 宋程, 谢振平. 中文纠错任务为例的数据集增强质量评价方法[J]. 计算机工程与应用, 2024, 60(3): 331-339. |
| [14] | 龙享福, 李少波, 张仪宗, 杨磊, 李传江. 因果学习方法和应用概述[J]. 计算机工程与应用, 2024, 60(24): 1-19. |
| [15] | 郑承蔚, 王海凤, 刘瑞. SDN中DDoS攻击检测研究综述[J]. 计算机工程与应用, 2024, 60(24): 79-96. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||