
Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (15): 72-92.DOI: 10.3778/j.issn.1002-8331.2409-0342
• Research Hotspots and Reviews • Previous Articles Next Articles
LI Daotong, LI Shengxin, WANG Bing, YAO Fanyi, LU Fei, AI Shanbin, ZHANG Binghui, SUN Xiu-qiang, WANG Ruolin
Online:2025-08-01
Published:2025-07-31
李道童,李盛新,王兵,姚藩益,芦飞,艾山彬,张炳会,孙秀强,王若琳
LI Daotong, LI Shengxin, WANG Bing, YAO Fanyi, LU Fei, AI Shanbin, ZHANG Binghui, SUN Xiu-qiang, WANG Ruolin. Review of Server Memory Reliability Technology[J]. Computer Engineering and Applications, 2025, 61(15): 72-92.
李道童, 李盛新, 王兵, 姚藩益, 芦飞, 艾山彬, 张炳会, 孙秀强, 王若琳. 服务器内存可靠性技术研究综述[J]. 计算机工程与应用, 2025, 61(15): 72-92.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2409-0342
| [1] BOGATINOVSKI J, KAO O, YU Q, et al. First CE matters: on the importance of long term properties on memory failure prediction[C]//Proceedings of the 2022 IEEE International Conference on Big Data. Piscataway: IEEE, 2022: 4733-4736. [2] DAYANAND N, QUAH A C T, CHEN C Q, et al. Static fault localization on memory failures using Photon Emission Microscopy[C]//Proceedings of the 2015 IEEE 22nd International Symposium on the Physical and Failure Analysis of Integrated Circuits. Piscataway: IEEE, 2015: 322-326. [3] 王群, 李馥娟, 倪雪莉, 等. 域间路由安全增强及区块链技术的应用研究[J]. 计算机科学与探索, 2024, 18(12): 3144-3174. WANG Q, LI F J, NI X L, et al. Research on blockchain-based inter-domain routing security enhancement[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(12): 3144-3174. [4] 张格毅, 陈小刚, 郭继鹏, 等. 利用相变存储器不对称性的写入优化方法[J]. 计算机工程与应用, 2021, 57(14): 75-82. ZHANG G Y, CHEN X G, GUO J P, et al. Solution to optimize PCM write depending on asymmetries[J]. Computer Engineering and Applications, 2021, 57(14): 75-82. [5] LEE J, KIM M J, KIM W S, et al. Review of memory RAS for data centers[J]. IEEE Access, 2023, 11: 124782-124796. [6] KIM D, KIM J. Adaptive granularity on-die ECC[C]//Proceedings of the 2022 19th International SoC Design Conference. Piscataway: IEEE, 2022: 318-319. [7] CHENG Z N, HAN S J, LEE P P C, et al. An in-depth correlative study between DRAM errors and server failures in production data centers[C]//Proceedings of the 2022 41st International Symposium on Reliable Distributed Systems. Piscataway: IEEE, 2022: 262-272. [8] FEVGAS A, AKRITIDIS L, ALAMANIOTIS M, et al. HyR-tree: a spatial index for hybrid flash/3D XPoint storage[J]. Neural Computing and Applications, 2023, 35(1): 133-145. [9] CARNIEL A C, AGUIAR C D. Spatial index structures for modern storage devices: a survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(9): 9578-9597. [10] SMAGULOVA K, FOUDA M E, ELTAWIL A. Thermal heating in ReRAM crossbar arrays: challenges and solutions[J]. IEEE Open Journal of Circuits and Systems, 2024, 5: 28-41. [11] LEE S, LEE N H, LEE K, et al. Development and product reliability characterization of advanced high speed 14nm DDR5 DRAM with on-die ECC[C]//Proceedings of the 2023 IEEE International Reliability Physics Symposium. Piscataway: IEEE, 2023: 1-4. [12] SEO H, RIM T, LEE E, et al. Analysis of intermittent single-bit failure on 10-nm node generation DRAM devices[C]//Proceedings of the 2023 IEEE International Reliability Physics Symposium. Piscataway: IEEE, 2023: 1-6. [13] SPESSOT A, OH H. 1T-1C dynamic random access memory status, challenges, and prospects[J]. IEEE Transactions on Electron Devices, 2020, 67(4): 1382-1393. [14] BRACKMANN L, JAFARI A, BENGEL C, et al. A failure analysis framework of ReRAM in-memory logic operations[C]//Proceedings of the 2022 IEEE International Test Conference in Asia. Piscataway: IEEE, 2022: 67-72. [15] SRIDHARAN V, DEBARDELEBEN N, BLANCHARD S, et al. Memory errors in modern systems: the good, the bad, and the ugly[C]//Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2015: 297-310. [16] ZHANG P C, WANG Y N, MA X H, et al. Predicting DRAM-caused node unavailability in hyper-scale clouds[C]//Proceedings of the 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2022: 275-286. [17] MAYUGA G, YAMATO Y, YONEDA T, et al. Reliability-enhanced ECC-based memory architecture using in-field self-repair[J]. IEICE Transactions on Information and Systems, 2016, 99(10): 2591-2599. [18] 许健, 陈平华, 熊建斌. 基于抽象内存模型的内存相关漏洞检测方法[J]. 计算机工程与应用, 2022, 58(8): 96-108. XU J, CHEN P H, XIONG J B. Memory-related vulnerability detection method based on abstract memory model[J]. Computer Engineering and Applications, 2022, 58(8): 96-108. [19] 路杉杉. March算法研究及其分析软件实现[D]. 成都: 电子科技大学, 2023. LU S S. March algorithm study and its analysis software implementation[D]. Chengdu: University of Electronic Science and Technology of China, 2023. [20] 王恩笙. 一种March类存储器测试序列解析方法的设计与实现[D]. 成都: 电子科技大学, 2024. WANG E S. Design and implementation of a method for analysis of march-type memory test sequences[D]. Chengdu: University of Electronic Science and Technology of China, 2024. [21] 葛云侠, 陈龙, 解维坤, 等. 大规模芯片内嵌存储器的BIST测试方法研究[J]. 国外电子测量技术, 2024, 43(5): 18-25. GE Y X, CHEN L, XIE W K, et al. Research on BIST testing method for large-scale chip embedded memory[J]. Foreign Electronic Measurement Technology, 2024, 43(5): 18-25. [22] BLAKE I F. Error control coding (S. Lin and D. J. Costello; 2004) [book review][J]. IEEE Transactions on Information Theory, 2005, 51(4): 1616-1617. [23] KWON J H, BAE H K, LEE Y S, et al. ZEC ECC: a zero-byte eliminating compression-based ECC scheme for DRAM reliability[J]. IEEE Access, 2024, 12: 100366-100376. [24] BAE H K, CHUNG M J, GONG Y H, et al. Twin ECC: a data duplication based ECC for strong DRAM error resilience[C]//Proceedings of the 2023 Design, Automation & Test in Europe Conference & Exhibition. Piscataway: IEEE, 2023: 1-6. [25] LIN S, COSTELLO D J. Error control coding[M]. USA: Prentice-Hall, 2004: 357-762. [26] REVIRIEGO P, LIU S S, XIAO L Y, et al. An efficient single and double-adjacent error correcting parallel decoder for the (24, 12) extended golay code[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016, 24(4): 1603-1606. [27] BENTOUTOU Y. A real time EDAC system for applications onboard earth observation small satellites[J]. IEEE Transactions on Aerospace and Electronic Systems, 2012, 48(1): 648-657. [28] LEE H, YOO Y, SHIN S H, et al. ECMO: ecc architecture reusing content-addressable memories for obtaining high reliability in DRAM[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022, 30(6): 781-793. [29] HAYAKAWA A, NAKAMURA T, DEGUCHI Y, et al. Data-aware partial ECC with data modulation of ReRAM with non-volatile in-memory computing for image recognition with deep neural network[C]//Proceedings of the 2018 IEEE International Symposium on Circuits and Systems. Piscataway: IEEE, 2018: 1-5. [30] DAS H, AHMAD HAIDOUS A, SMITH S C, et al. Flexible low-cost power-efficient video memory with ECC-adaptation[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021, 29(10): 1693-1706. [31] KIM K, LIM S H. Compression and variable-sized ECC scheme for the reliable flash memory system[C]//Advances in Computer Science and Ubiquitous Computing. Singapore: Springer Singapore, 2017: 1232-1236. [32] ALACCHI A, GIACOMIN E, TEMPLE S, et al. Low latency SEU detection in FPGA CRAM with in-memory ECC checking[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(5): 2028-2036. [33] WU S Z, DU C F, ZHU W D, et al. EaD: ecc-assisted deduplication with high performance and low memory overhead for ultra-low latency flash storage[J]. IEEE Transactions on Computers, 2023, 72(1): 208-221. [34] 田欢. 低冗余存储器相邻双错误纠正码设计[D]. 哈尔滨: 哈尔滨工业大学, 2011. TIAN H. Low redundancy adjacent double: errors correction codes in memory[D]. Harbin: Harbin Institute of Technology, 2011. [35] 王丹宁. 面向连续多位翻转的纠错码研究与实现[D]. 长沙: 国防科技大学, 2021. WANG D N. The research and implementation of error correction code for continuous multi-bit upsets[D]. Changsha: National University of Defense Technology, 2021. [36] 安天乐. 高效片上存储纠错技术研究与实现[D]. 长沙: 国防科技大学, 2018. AN T L. The research and implementation of efficient error correction technology for on-chip memory[D]. Changsha: National University of Defense Technology, 2018. [37] LI J, REVIRIEGO P, XIAO L, et al. Protecting memories against soft errors: the case for customizable error correction codes[J]. IEEE Transactions on Emerging Topics in Computing, 2021, 9(2): 651-663. [38] BENSO A, CHIUSANO S, DI NATALE G, et al. An on-line BIST RAM architecture with self-repair capabilities[J]. IEEE Transactions on Reliability, 2002, 51(1): 123-128. [39] BUCH S. Error detecting and correcting codes for DRAM functional safety[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2023: 1-5. [40] DU X M, LI C. Predicting uncorrectable memory errors from the correctable error history: no free predictors in the field[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2021: 1-10. [41] DELL T J. System RAS implications of DRAM soft errors[J]. IBM Journal of Research and Development, 2008, 52(3): 307-314. [42] KIM J, KWON S, NOH J, et al. Construction of cyclic redundancy check codes for SDDC decoding in DRAM systems[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2023, 70(2): 736-740. [43] LI W, ZHANG M, GUI T W, et al. Improving DRAM reliability using a high order error correction code[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 43(12): 4775-4785. [44] WARNES L, CALHOUN M B, CARR D, et al. Rank sparing system and method: US8892942[P]. 2014-11-18. [45] DU X M, LI C, ZHOU S, et al. Predicting uncorrectable memory errors for proactive replacement: an empirical study on large-scale field data[C]//Proceedings of the 2020 16th European Dependable Computing Conference. Piscataway: IEEE, 2020: 41-46. [46] GIURGIU I, SZABO J, WIESMANN D, et al. Predicting DRAM reliability in the field with machine learning[C]//Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track. New York: ACM, 2017: 15-21. [47] BOIXADERAS I, ZIVANOVIC D, MORé S, et al. Cost-aware prediction of uncorrected DRAM errors in the field[C]//Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-15. [48] CRISS K, BAINS K, AGARWAL R, et al. Improving memory reliability by bounding DRAM faults: ddr5 improved reliability features[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2020: 317-322. [49] MEZA J, WU Q, KUMAR S, et al. Revisiting memory errors in large-scale production data centers: analysis and modeling of new trends from the field[C]//Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2015: 415-426. [50] DU X M, LI C, ZHOU S, et al. Fault-aware prediction-guided page offlining for uncorrectable memory error prevention[C]//Proceedings of the 2021 IEEE 39th International Conference on Computer Design. Piscataway: IEEE, 2021: 456-463. [51] DU X M, LI C. DPCLS: improving partial cache line sparing with dynamics for memory error prevention[C]//Proceedings of the 2020 IEEE 38th International Conference on Computer Design. Piscataway: IEEE, 2020: 197-204. [52] DU X M, LI C. Combining error statistics with failure prediction in memory page offlining[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2019: 127-132. [53] COSTA C H A, PARK Y, ROSENBURG B S, et al. A system software approach to proactive memory-error avoidance[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2014: 707-718. [54] LI C, ZHANG Y, WANG J L, et al. From correctable memory errors to uncorrectable memory errors: what error bits tell[C]//Proceedings of the SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2022: 1-14. [55] JUNG J, EREZ M. Predicting future-system reliability with a component-level DRAM fault model[C]//Proceedings of the 2023 56th IEEE/ACM International Symposium on Microarchitecture. Piscataway: IEEE, 2023: 944-956. [56] BREITENBACH T, MALAVALLI DIVAKAR S, RASBACH L, et al. ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based model drift determination[J]. Journal of Parallel and Distributed Computing, 2024, 185: 104800. [57] YU Q, ZHANG W G, NOTARO P, et al. HiMFP: hierarchical intelligent memory failure prediction for cloud service reliability[C]//Proceedings of the 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway: IEEE, 2023: 216-228. [58] BASEMAN E, DEBARDELEBEN N, FERREIRA K, et al. Improving DRAM fault characterization through machine learning[C]//Proceedings of the 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop. Piscataway: IEEE, 2016: 250-253. [59] BASEMAN E, DEBARDELEBEN N, FERREIRA K, et al. Automating DRAM fault mitigation by learning from experience[C]//Proceedings of the 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops. Piscataway: IEEE, 2017: 137-140. [60] GIURGIU I, WIESMANN D, BIRD J. “Memory loss” in commodity hardware : predicting DIMM failures with machine learning[C]//Proceedings of the 10th ACM International Systems and Storage Conference. New York: ACM, 2017. [61] YU Q, CARDOSO J, KAO O. Unveiling DRAM failures across different CPU architectures in large-scale datacenters[C]//Proceedings of the 2024 IEEE 44th International Conference on Distributed Computing Systems. Piscataway: IEEE, 2024: 1462-1463. [62] YU F Y, XU H Z, JIAN S L, et al. DRAM failure prediction in large-scale data centers[C]//Proceedings of the 2021 IEEE International Conference on Joint Cloud Computing. Piscataway: IEEE, 2021: 1-8. [63] WANG X Y, LI Y, CHEN Y Q, et al. On workload-aware DRAM failure prediction in large-scale data centers[C]//Proceedings of the 2021 IEEE 39th VLSI Test Symposium. Piscataway: IEEE, 2021: 1-6. [64] DU X M, LI C. Memory failure prediction using online learning[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2018: 38-49. [65] BASEMAN E, DEBARDELEBEN N, BLANCHARD S, et al. Physics-informed machine learning for DRAM error modeling[C]//Proceedings of the 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems. Piscataway: IEEE, 2018: 1-6. [66] ZHENG X, WU L Z, DONG D N, et al. Endurance prediction based on hidden Markov model and programming optimization for 28 nm 1 Mbit resistive random access memory chip[J]. IEEE Electron Device Letters, 2023, 44(6): 919-922. [67] 李盛新, 李道童, 贾帅帅, 等. 内存状态检测方法、装置、通信设备及存储介质: CN202310935420. 8[P]. 2023-11-03. LI S X, LI D T, JIA S S, et al. Memory state detection method and apparatus, and communication device and storage medium: CN202310935420. 8[P]. 2023-11-03. [68] WU R L, ZHOU S Y, LU J H, et al. Removing obstacles before breaking through the memory wall: a close look at HBM errors in the field[C]//Proceedings of the 2024 Usenix Annual Technical Conference, 2024: 851-867. [69] WANG J, YANG H Z, LI C, et al. Boosting data center performance via intelligently managed multi-backend disaggregated memory[C]//Proceedings of the SC24: International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2024: 1-18. [70] LI P F, HUA Y, ZUO P F, et al. A high-performance RDMA-oriented learned key-value store for disaggregated memory systems[J]. ACM Transactions on Storage, 2023, 19(4): 1-30. [71] LI H C, BERGER D S, HSU L, et al. Pond: cxl-based memory pooling systems for cloud platforms[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. New York: ACM, 2023: 574-587. [72] LEE H, CHOI K, LEE H J, et al. SDM: sharing-enabled disaggregated memory system with cache coherent compute express link[C]//Proceedings of the 2023 32nd International Conference on Parallel Architectures and Compilation Techniques, 2023: 86-98. |
| [1] | WU Ruiqi, ZHOU Yi. Multi-Source Multi-Task Learning with Knowledge Integration for Fundus Disease Classification [J]. Computer Engineering and Applications, 2025, 61(7): 255-266. |
| [2] | WEI Jiamei, YUAN Shujuan, KONG Shanshan, YANG Aimin, ZHAO Chenying. Development and Application of Light Gradient Boosting Machine [J]. Computer Engineering and Applications, 2025, 61(5): 32-42. |
| [3] | ZHAO Chanchan, LYU Fei, SHI Bao, YU Xiaomin, YANG Xingchen, YUE Xiaocan. Review of Collaborative Inference Methods for Edge Intelligence [J]. Computer Engineering and Applications, 2025, 61(3): 1-20. |
| [4] | FANG Keyuan, XU Kewei. Complementary Strengths of LLMs and ML: Government Service Response Quality Detection and Explanation Algorithm Framework [J]. Computer Engineering and Applications, 2025, 61(16): 146-159. |
| [5] | WAN Jiling, CAO Lifeng, BAI Jinlong, LI Jinhui, DU Xuehui. Survey of Anomaly Detection Methods for Blockchain Networks [J]. Computer Engineering and Applications, 2025, 61(13): 78-99. |
| [6] | PENG Yanfei, GUO Jialong, HUANG Jin, ZHENG Hongwei, WANG Gengzhe. Methods for Mineable Cryptocurrency Identification Based on Network Traffic [J]. Computer Engineering and Applications, 2025, 61(13): 200-207. |
| [7] | HU Xiangkun, LI Hua, FENG Yixiong, QIAN Songrong, LI Jian, LI Shaobo. Research Advance of Crack Detection for Infrastructure Surfaces Based on Deep Learning [J]. Computer Engineering and Applications, 2025, 61(1): 1-23. |
| [8] | HUANG Shiyang, XI Xuefeng, CUI Zhiming. Research and Exploration on Chinese Natural Language Processing in Era of Large Language Models [J]. Computer Engineering and Applications, 2025, 61(1): 80-97. |
| [9] | PEI Wencan, SUN Guangwei, HUANG Jinguo, XU Dinghui, LIU Jing. Immediate Prediction Model of SPAD Value and Maturity of Fresh Tobacco Leaves in Field [J]. Computer Engineering and Applications, 2024, 60(8): 348-360. |
| [10] | XING Changzheng, XU Jiayu. Hybrid LightGBM Model for Breast Cancer Diagnosis [J]. Computer Engineering and Applications, 2024, 60(6): 330-338. |
| [11] | JIANG Lulu, GAO Jintao. Survey of Machine Learning for Database Parameter Tuning Techniques [J]. Computer Engineering and Applications, 2024, 60(3): 1-16. |
| [12] | WU Haitao, CAI Yongqi, GAO Jianhua. Bagging Heterogeneous Ensemble Code Smell Detection and Refactoring Priority Division [J]. Computer Engineering and Applications, 2024, 60(3): 138-147. |
| [13] | SONG Cheng, XIE Zhenping. Dataset Enhancement Quality Evaluation Method for Chinese Error Correction Task as Example [J]. Computer Engineering and Applications, 2024, 60(3): 331-339. |
| [14] | LONG Xiangfu, LI Shaobo, ZHANG Yizong, YANG Lei, LI Chuanjiang. Overview of Causal Learning Techniques and Applications [J]. Computer Engineering and Applications, 2024, 60(24): 1-19. |
| [15] | ZHENG Chengwei, WANG Haifeng, LIU Rui. Review of Research on DDoS Attack Detection in SDN [J]. Computer Engineering and Applications, 2024, 60(24): 79-96. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||