基于数据填补和连续属性的朴素贝叶斯算法

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (1): 133-140.

基于数据填补和连续属性的朴素贝叶斯算法

李忠波，杨建华，刘文琦

大连理工大学控制科学与控制工程学院，辽宁大连 116024

出版日期:2016-01-01 发布日期:2015-12-30

Naive Bayes based on data filling and continuous attribute

LI Zhongbo, YANG Jianhua, LIU Wenqi

School of Control Science and Engineering, Dalian University of Technology, Dalian, Liaoning 116024, China

Online:2016-01-01 Published:2015-12-30

摘要/Abstract

摘要： 朴素贝叶斯算法（NB）在处理分类问题时通常假设训练样本的数值型连续属性满足正态分布，其分类精度也受到训练数据完整性的影响，而实际采样数据很难满足上述要求。针对数据缺失问题，基于期望最大值算法（EM），将朴素贝叶斯分类器利用已有的不完整数据进行参数学习；针对样本数值型连续属性非正态分布的情况，基于核密度估计，利用其分布密度（Distribution Density）和新的分析计算方法来求最大后验分布，同时用标准数据集的分类实验验证了改进的有效性。将改良的算法EM-DNB应用在生物工程蛋白质纯化工艺预测中，实验结果表明，预测精度有所提高。

关键词: 朴素贝叶斯（NB）, 期望最大值（EM）算法, 连续属性, 核密度估计, 蛋白质纯化

Abstract: When dealing with classification problem, Naive Bayes（NB） usually assumes that the numerical continuous attributes follow normal distribution, the classification accuracy is also affected by the integrity of training data. But the actual sampled data are difficult to meet the above requirements. For missing data, the Naive Bayesian classifier uses existing incomplete data to implement parameter learning based on the Expectation-Maximum（EM） algorithm; for non-
normal numerical continuous attributes, distribution density based on kernel density estimation and a new method are used to calculate the maximum posterior probability, meanwhile, the classification experiment using standard data sets verifies the effectiveness of the improvement. Finally, the improved algorithm（EM-DNB） is applied to the prediction of the protein purification technologies in biological engineering. The experimental results show that the accuracy is improved.

Key words: Naive Bayes（NB）, Expectation-Maximum（EM） algorithm, continuous attributes, kernel?density?estimation, protein purification

李忠波，杨建华，刘文琦. 基于数据填补和连续属性的朴素贝叶斯算法[J]. 计算机工程与应用, 2016, 52(1): 133-140.

LI Zhongbo, YANG Jianhua, LIU Wenqi. Naive Bayes based on data filling and continuous attribute[J]. Computer Engineering and Applications, 2016, 52(1): 133-140.

[1]	张博文，刘智，桑国明. 基于核密度波动的异常检测算法[J]. 计算机工程与应用, 2021, 57(12): 132-136.
[2]	王彩文，杨有龙. 针对不平衡数据的改进的近邻分类算法[J]. 计算机工程与应用, 2020, 56(7): 30-38.
[3]	王光，林国宇. 改进的自适应参数DBSCAN聚类算法[J]. 计算机工程与应用, 2020, 56(14): 45-51.
[4]	安纪存，吕鑫，季琳雅. 不完全数据下基于时空相关性拥堵预测方法[J]. 计算机工程与应用, 2019, 55(4): 96-100.
[5]	张荣光，胡晓辉，宗永胜. 基于改进离散粒子群优化的连续属性离散化[J]. 计算机工程与应用, 2017, 53(18): 108-114.
[6]	李宗林，罗可. DBSCAN算法中参数的自适应确定[J]. 计算机工程与应用, 2016, 52(3): 70-73.
[7]	乔颖，王士同. 快速大样本同步聚类[J]. 计算机工程与应用, 2016, 52(23): 159-166.
[8]	张金敏1，雷江1，2. 统计建模与对象更新机制相结合的背景减法[J]. 计算机工程与应用, 2016, 52(11): 152-157.
[9]	刘超，惠晶. 基于改进CAMShift的运动目标跟踪算法[J]. 计算机工程与应用, 2014, 50(11): 149-153.
[10]	刘娣，高美凤. 基于背景差分的核密度估计前景检测方法[J]. 计算机工程与应用, 2013, 49(6): 170-174.
[11]	汪凌. 一种基于改进粒子群的连续属性离散化算法[J]. 计算机工程与应用, 2013, 49(21): 29-32.
[12]	乔俊锋，朱虹，史静，孟凡星. 一种快速核密度估计背景建模方法[J]. 计算机工程与应用, 2012, 48(5): 192-193.
[13]	董敏，朱虹，邢楠，赵朝杰. 核密度估计的单幅图像相机响应逆函数求解方法[J]. 计算机工程与应用, 2012, 48(10): 171-174.
[14]	芮挺1，周遊2，马光彦1，廖明1. 核密度估计与高斯模型联级运动目标检测[J]. 计算机工程与应用, 2011, 47(18): 1-3.
[15]	宛明高^1，2，李晓辉¹. ICA自适应核估计在多用户检测中的应用[J]. 计算机工程与应用, 2010, 46(31): 232-234.