Spark平台下类别数据互信息计算的并行化

doi:10.3778/j.issn.1002-8331.2003-0432

计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (7): 95-100.DOI: 10.3778/j.issn.1002-8331.2003-0432

Spark平台下类别数据互信息计算的并行化

李俊丽

晋中学院信息技术与工程学院，山西晋中 030619

出版日期:2021-04-01 发布日期:2021-04-02

Parallel Mutual-Information Computation of Categorical Data Based on Spark

LI Junli

School of Information technology and Engineering, Jinzhong University, Jinzhong, Shanxi 030619, China

Online:2021-04-01 Published:2021-04-02

摘要/Abstract

摘要：

针对大规模类别数据的互信息计算量非常大的问题，利用Spark内存计算平台，提出了类别数据的并行互信息计算方法，该算法首先采用列变换将数据集转换成多个数据子集；然后采用两个变长数组缓存中间结果，解决了类别数据特征对间互信息计算量大、重复性强的问题；最后在配备了24个计算节点的Spark集群中，使用人工合成和真实数据集验证了算法。实验结果表明，该算法在效率、可伸缩性和可扩展性等方面都达到了较高的性能。

关键词: 列变换, 并行互信息计算, 分类数据, Spark平台

Abstract:

To resolve the problem of large amount of mutual information calculation for large-scale categorical data, this paper proposes a Parallel Mutual information calculation method for categorical data（PMS）, which is based on the Spark memory computing platform. This algorithm first uses column transformation to transform the data set into multiple data subsets. And then, PMS uses two variable-length arrays to cache intermediate results, solving the problem of large amount of calculation and strong repeatability in categorical data mutual information calculation. Finally, PMS algorithm is implemented and evaluated in a Spark cluster equipped with 24 computing nodes using artificial and real data sets. Experimental results verify that PMS algorithm achieves high performance in terms of efficiency, scalability and scalability.

Key words: column-wise transformation, Parallel Mutual-information computation, categorical data, Spark platform

李俊丽. Spark平台下类别数据互信息计算的并行化[J]. 计算机工程与应用, 2021, 57(7): 95-100.

LI Junli. Parallel Mutual-Information Computation of Categorical Data Based on Spark[J]. Computer Engineering and Applications, 2021, 57(7): 95-100.

[1]	刘佳耀，王佳斌. Slope One算法的改进及其在大数据平台的实现[J]. 计算机工程与应用, 2020, 56(1): 83-91.
[2]	林强，唐加山. 一种适用于混合型分类数据的聚类算法[J]. 计算机工程与应用, 2019, 55(1): 168-173.
[3]	李格非1，马蔚吟2，李力3. Spark平台下的凸包问题研究[J]. 计算机工程与应用, 2018, 54(22): 67-73.
[4]	曲朝阳1，陈贺新1，胡可为2，刘耀伟3，独健鸿4. 基于Spark的电力调度数据整合模型[J]. 计算机工程与应用, 2017, 53(19): 65-70.
[5]	缪雪峰1，陈群辉1，胡罗凯2，刘进1. Spark平台下基于上下文信息的影片混合推荐[J]. 计算机工程与应用, 2017, 53(10): 79-84.
[6]	郑植^1，2，李广军²，滕云龙¹. 一种高性能的二维波达方向估计算法[J]. 计算机工程与应用, 2011, 47(6): 135-137.
[7]	范延军,孙燮华. 一种度量图像置乱加密程度的新算法[J]. 计算机工程与应用, 2007, 43(29): 93-94.

Spark平台下类别数据互信息计算的并行化

Parallel Mutual-Information Computation of Categorical Data Based on Spark

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics