计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (1): 133-140.

• 模式识别与人工智能 • 上一篇    下一篇

基于数据填补和连续属性的朴素贝叶斯算法

李忠波,杨建华,刘文琦   

  1. 大连理工大学 控制科学与控制工程学院,辽宁 大连 116024
  • 出版日期:2016-01-01 发布日期:2015-12-30

Naive Bayes based on data filling and continuous attribute

LI Zhongbo, YANG Jianhua, LIU Wenqi   

  1. School of Control Science and Engineering, Dalian University of Technology, Dalian, Liaoning 116024, China
  • Online:2016-01-01 Published:2015-12-30

摘要: 朴素贝叶斯算法(NB)在处理分类问题时通常假设训练样本的数值型连续属性满足正态分布,其分类精度也受到训练数据完整性的影响,而实际采样数据很难满足上述要求。针对数据缺失问题,基于期望最大值算法(EM),将朴素贝叶斯分类器利用已有的不完整数据进行参数学习;针对样本数值型连续属性非正态分布的情况,基于核密度估计,利用其分布密度(Distribution Density)和新的分析计算方法来求最大后验分布,同时用标准数据集的分类实验验证了改进的有效性。将改良的算法EM-DNB应用在生物工程蛋白质纯化工艺预测中,实验结果表明,预测精度有所提高。

关键词: 朴素贝叶斯(NB), 期望最大值(EM)算法, 连续属性, 核密度估计, 蛋白质纯化

Abstract: When dealing with classification problem, Naive Bayes(NB) usually assumes that the numerical continuous attributes follow normal distribution, the classification accuracy is also affected by the integrity of training data. But the actual sampled data are difficult to meet the above requirements. For missing data, the Naive Bayesian classifier uses existing incomplete data to implement parameter learning based on the Expectation-Maximum(EM) algorithm; for non-
normal numerical continuous attributes, distribution density based on kernel density estimation and a new method are used to calculate the maximum posterior probability, meanwhile, the classification experiment using standard data sets verifies the effectiveness of the improvement. Finally, the improved algorithm(EM-DNB) is applied to the prediction of the protein purification technologies in biological engineering. The experimental results show that the accuracy is improved.

Key words: Naive Bayes(NB), Expectation-Maximum(EM) algorithm, continuous attributes, kernel?density?estimation, protein purification