多类属性加权与正交变换融合的朴素贝叶斯

doi:10.3778/j.issn.1002-8331.2211-0411

摘要/Abstract

摘要： 由于朴素贝叶斯算法忽略了数据多维属性的相关性，从而导致分类算法的极大应用局限。对此提出多类属性加权与正交变换融合的朴素贝叶斯改进算法。利用贡献度与相关互信息去量化离散属性以及离散属性值之间的相关程度，以获得其权重；利用正交变换方法消除连续属性之间的线性关系；将加权后的离散属性和正交变换后的连续属性的条件概率进行区分计算，从而得到较高的分类精度并提高算法的泛化能力。通过在公开数据集以及校园一卡通数据集上的[k]折交叉验证，实验结果表明，与最新的5种改进朴素贝叶斯算法相比，该算法的准确率高了7.19~9.94个百分点，加权平均F1值高了6.4~11.64个百分点。

关键词: 多维混合属性, 离散属性加权, 离散属性值加权, 正交变换, [k]折交叉验证

Abstract: Because the Naive Bayes algorithm ignores the correlation of multi-dimensional attributes of data, it leads to great application limitations of classification algorithms. In this paper, an improved Naive Bayes algorithm combining multiple attribute weighting and orthogonal transformation is proposed. Firstly, the contribution degree and related mutual information are used to quantify the correlation between discrete attributes and discrete attribute values to obtain their weights. Then, the orthogonal transformation method is used to eliminate the linear relationship between continuous attributes. Then, the conditional probabilities of the weighted discrete attributes and the continuous attributes after orthogonal transformation are distinguished and calculated to obtain higher classification accuracy and improve the generalization ability of the algorithm. Through the [k]-fold cross-validation on the public data set and the campus card data set, the experimental results show that compared with the latest five improved Naive Bayes algorithms, the accuracy of the proposed algorithm is 7.19~9.94 percentage points higher, and the weighted average F1 value is 6.4~11.64 percentage points higher.

Key words: multidimensional mixed attributes, discrete attribute weighted, discrete attribute value weighted, orthogonal transformation, [k]-fold cross validation

刘海涛, 陈春梅, 庞忠祥, 梁志强, 李晴. 多类属性加权与正交变换融合的朴素贝叶斯[J]. 计算机工程与应用, 2023, 59(18): 84-97.

LIU Haitao, CHEN Chunmei, PANG Zhongxiang, LIANG Zhiqiang, LI Qing. Naive Bayes Fusion of Multiple Attribute Weighting and Orthogonal Transformation[J]. Computer Engineering and Applications, 2023, 59(18): 84-97.

参考文献

[1] 赵亮，刘建辉，崔彩峰.互信息匹配的半朴素贝叶斯分类器[J].计算机工程与应用，2016，52（18）：84-87.
ZHAO L，LIU J H，CUI C F.Semi-Naive Bayesian classifier matched by mutual information[J].Computer Engineering and Applications，2016，52（18）：84-87.
[2] FRIEDMAN N，GEIGER D，GOLDSZMIDT M.Bayesian network classifiers[J].Machine Learning，1997，29（2/3）：131-163.
[3] WEBB G I，BOUGHTON J R，WANG Z H.Not so Naive Bayes：aggregating one-dependence estimators[J].Machine Learning，2005，58（1）.
[4] JIANG L X，ZHANG H，CAI Z H.A novel Bayes model：hidden Naive Bayes[J].IEEE Transactions on Knowledge & Data Engineering，2009，21（10）：1361-1371.
[5] 张文钧，蒋良孝，张欢.基于特征增广的生成-判别混合模型构建方法[J].中国科学：信息科学，2022，52（10）：1792-1807.
ZHANG W J，JIANG L X，ZHANG H.A feature augmentation-based method for constructing generative-discriminative hybrid models[J].Scientia Sinca Informations，2022，52（10）：1792-1807.
[6] QIU C，JIANG L X，LI C Q.Not always simple classification：learning superparent for class probability estimation[J].Expert Systems with Applications，2015，42（13）.
[7] LANGLEY P，SAGE S.Induction of selective Bayesian classifiers[J].arXiv：1302.6828，2013.
[8] HALL M A.Correlation-based feature selection of discrete and numeric class machine learning[C]//Proceedings of the 17th International Conference on Machine Learning，2000：359-366.
[9] 徐玲玲，迟冬祥.面向不平衡数据集的机器学习分类策略[J].计算机工程与应用，2020，56（24）：12-27.
XU L L，CHI D X.Machine learning classification strategy for imbalanced data sets[J].Computer Engineering and Applications，2020，56（24）：12-27.
[10] 马文，陈庚，李昕洁，等.基于朴素贝叶斯算法的中文评论分类[J].计算机应用，2021，41（S2）：31-35.
MA W，CHEN G，LI X J，et al.Chinese comment classification based on Naive Bayes algorithm[J].Journal of Computer Applications，2021，41（S2）：31-35.
[11] CHEN S L，MARTINEZ ANA M，WEBB G I，et al.Sample-based attribute selective an DE for large data[J].IEEE Transactions on Knowledge and Data Engineering，2017，29（1）：172-185.
[12] HALL M.A decision tree-based attribute weighting filter for Naive Bayes[J].Knowledge-Based Systems，2006，20（2）.
[13] ZAIDI N A，CERQUIDES J，CARMAN M J，et al.Alleviating Naive Bayes attribute independence assumption by attribute weighting[J].Journal of Machine Learning Research，2013，14（1）：1947-1988.
[14] JIANG L X，ZHANG L G，LI C Q，et al.A correlation-based feature weighting filter for Naive Bayes[J].IEEE Transactions on Knowledge and Data Engineering，2019，31（2）：201-213.
[15] KOHAVI R.Scaling up the accuracy of Naive-Bayes classifiers：a decision-tree hybrid[C]//Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining，1996：202-207.
[16] FRANK E，HALL M A，PFAHRINGER B.Locally weighted Naive Bayes[J].arXiv：1212.2487，2012.
[17] JIANG L X，WANG D H，CAI Z H.Discriminatively weighted Naive Bayes and its application in text classification[J].International Journal on Artificial Intelligence Tools，2012，21（1）.
[18] XU W Q，JIANG L X，YU L J.An attribute value frequency-based instance weighting filter for Naive Bayes[J].Journal of Experimental & Theoretical Artificial Intelligence，2019，31（2）.
[19] DUAN Z Y，WANG L M，CHEN S L，et al.Instance-based weighting filter for superparent one-dependence estimators[J].Knowledge-Based Systems，2020，203.
[20] HINDI K E.Fine tuning the Na?ve Bayesian learning algorithm[J].AI Communications，2014，27（2）.
[21] HINDI K M E，ALJULAIDAN R R，ALSALMAN H.Lazy fine-tuning algorithms for na?ve Bayesian text classification[J].Applied Soft Computing Journal，2020，96.
[22] 李福祥，王建敏，梁建创，等.离散属性的朴素贝叶斯分类算法的优化[J].小型微型计算机系统，2022，43（5）：897-901.
LI F X，WANG J M，LIANG J C，et al.Optimization of Naive Bayesian classification algorithm for discrete attributes[J].Journal of Chinese Computer Systems，2022，43（5）：897-901.
[23] 宁可，孙同晶，赵浩强.基于属性关联的朴素贝叶斯分类算法[J].计算机工程，2018，44（6）：18-23.
NING K，SUN T J，ZHAO H Q.Naive Bayesian classification algorithm based on attribute association[J].Computer Engineering，2018，44（6）：18-23.
[24] 丁月，汪学明.基于改进特征加权的朴素贝叶斯分类算法[J].计算机应用研究，2019，36（12）：3597-3600.
DING Y，WANG X M.Naive Bayes classification algorithm based on improved feature weighting[J].Application Research of Computers，2019，36（12）：3597-3600.
[25] 赵博文，王灵矫，郭华.基于泊松分布的加权朴素贝叶斯文本分类算法[J].计算机工程，2020，46（4）：91-96.
ZHAO B W，WANG L J，GUO H.Weighted Na?ve Bayes text classification algorithm based on poisson distribution[J].Computer Engineering，2020，46（4）：91-96.
[26] ZHANG H，JIANG L X，LI C Q.Attribute augmented and weighted Naive Bayes[J].Science China Information Sciences，2022，65（12）.
[27] ZHANG H，JIANG L X，ZHANG W J，et al.Multi-view attribute weighted Naive Bayes[J].IEEE Transactions on Knowledge and Data Engineering，2023，35（7）：7291-7302.
[28] YU L，JIANG L，WANG D，et al.Toward Naive Bayes with attribute value weighting[J].Neural Computing and Applications，2019，31（10）：5699-5713.
[29] 秦锋，任诗流，程泽凯，等.基于属性加权的朴素贝叶斯分类算法[J].计算机工程与应用，2008，44（6）：107-109.
QIN F，REN S L，CHENG Z K，et al.Attribute weighted Naive Bayes classification[J].Computer Engineering and Applications，2008，44（6）：107-109.
[30] LEE C H.A gradient approach for value weighted classification learning in naive bayes[J].Knowledge-Based Systems，2015，85：71-79.
[31] LEE C H.An information-theoretic filter approach for value weighted classification learning in Naive Bayes[J].Data & Knowledge Engineering，2018，113：116-128.
[32] ZHANG H，JIANG L，YU L.Class-specific attribute value weighting for Naive Bayes[J].Information Sciences，2020，508：260-274.
[33] DIAB D M，EL HINDI K M.Using differential evolution for fine tuning Nave Bayesian classifiers and its application for text classification[J].Applied Soft Computing，2016，54：183-199.
[34] EL HINDI K，ALSALMAN H，QASEM S，et al.Building an ensemble of fine-tuned Naive Bayesian classifiers for text classification[J].Entropy，2018，20（11）：857.
[35] ZHANG H，JIANG L.Fine tuning attribute weighted Naive Bayes[J].Neurocomputing，2022，488：402-411.
[36] 周志华.机器学习[M].北京：清华大学出版社，2016：150-154.
ZHOU Z H.Machine learning[M].Beijing：Tsinghua University Press，2016：150-154.
[37] 王乐，韩萌，李小娟，等.不平衡数据集分类方法综述[J].计算机工程与应用，2021，57（22）：42-52.
WANG L，HAN M，LI X J，et al.Review of classification methods for unbalanced data sets[J].Computer Engineering and Applications，2021，57（22）：42-52.
[38] 刘依璐，曹付元.含缺失标签的大规模多标签分类算法[J].计算机工程与应用，2022，58（17）：148-157.
LIU Y L，CAO F Y.Large-scale multi-label classification algorithm with missing labels[J].Computer Engineering and Applications，2022，58（17）：148-157.
[39] 熊中敏，郭怀宇，吴月欣.缺失数据处理方法研究综述[J].计算机工程与应用，2021，57（14）：27-38.
XIONG Z M，GUO H Y，WU Y X.Review of missing data processing methods[J].Computer Engineering and Applications，2021，57（14）：27-38.
[40] 王轩，顾峰，闵帆，等.基于代表的交叉验证分类[J].重庆邮电大学学报（自然科学版），2021，33（5）：826-833.
WANG X，GU F，MIN F，et al.Representative-based cross validation classification[J].Journal of Chongqing University of Posts and Telecommunications（Natural Science Edition），2021，33（5）：826-833.