计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (15): 72-92.DOI: 10.3778/j.issn.1002-8331.2409-0342

• 热点与综述 • 上一篇    下一篇

服务器内存可靠性技术研究综述

李道童,李盛新,王兵,姚藩益,芦飞,艾山彬,张炳会,孙秀强,王若琳   

  1. 1.浪潮电子信息产业股份有限公司 硬件研发一部,济南 250000
    2.菏泽学院 计算机学院,山东 菏泽 274000
  • 出版日期:2025-08-01 发布日期:2025-07-31

Review of Server Memory Reliability Technology

LI Daotong, LI Shengxin, WANG Bing, YAO Fanyi, LU Fei, AI Shanbin, ZHANG Binghui, SUN Xiu-qiang, WANG Ruolin   

  1. 1.Hardware R&D Department 1, Inspur Electronic Information Industry Co., Ltd., Jinan 250000, China
    2.School of Computer, Heze University, Heze, Shandong 274000, China
  • Online:2025-08-01 Published:2025-07-31

摘要: 内存作为服务器中的核心部件,随着内存技术的持续迭代与性能的显著提升,其可靠性问题已成为影响服务器整体稳定性不可忽视的关键因素。回顾了内存技术的演进历程、结构特性及其发展对服务器性能的直接影响,深入剖析了内存故障模式的多样性与深层次复杂性。进一步地,详尽探讨了故障检测与处理的最新技术进展,特别强调了内存纠错码和内存故障容错技术的重要作用,并聚焦于内存风险单元预测技术的前沿探索,尤其是结合确定性规则或机器学习算法的内存故障预测方法。在此基础上,对当前内存可靠性领域面临的核心挑战进行了系统性分析,并前瞻性地展望了未来研究方向,涵盖内存老化精准预测、健康状态实时监测及机器学习在预测分析中的深度应用等关键领域。最终强调,在追求服务器内存性能极致化的同时,必须并行提升其稳定性与可靠性,以适应日益增长的服务器性能需求,为内存可靠性技术的未来发展提供了宝贵的实践指导与理论参考。

关键词: 服务器内存, 故障容错, 故障检测, 健康监测, 机器学习

Abstract: Memory, as a core component in servers, has experienced continuous iterations in technology and significant improvements in performance, yet its reliability issues have emerged as a critical factor that cannot be overlooked in influencing the overall stability of servers. This paper reviews the evolution of memory technology, its structural characteristics, and the direct impact of its development on server performance. It delves into the diversity and deep complexity of memory failure modes. Furthermore, the article comprehensively explores the latest technological advancements in fault detection and handling, with particular emphasis on the pivotal role of error correction codes (ECC) and RAS (reliability, availability, serviceability) technologies for memory. It also focuses on the forefront exploration of memory risk cell prediction technologies, particularly memory failure prediction methods that integrate deterministic rules or machine learning algorithms. Based on this foundation, the paper conducts a systematic analysis of the core challenges facing the current memory reliability field and offers a forward-looking outlook on future research directions, encompassing precise prediction of memory aging, real-time monitoring of health status, and the profound application of machine learning in predictive analysis. Ultimately, the paper underscores the necessity of concurrently enhancing the stability and reliability of server memory while pursuing its ultimate performance, to accommodate the ever-growing demands on server performance. This provides invaluable practical guidance and theoretical references for the future development of memory RAS technologies.

Key words: server memory, fault tolerance, fault detection, health monitoring, machine learning