Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (7): 95-100.DOI: 10.3778/j.issn.1002-8331.2003-0432

Previous Articles     Next Articles

Parallel Mutual-Information Computation of Categorical Data Based on Spark

LI Junli   

  1. School of Information technology and Engineering, Jinzhong University, Jinzhong, Shanxi 030619, China
  • Online:2021-04-01 Published:2021-04-02



  1. 晋中学院 信息技术与工程学院,山西 晋中 030619


To resolve the problem of large amount of mutual information calculation for large-scale categorical data, this paper proposes a Parallel Mutual information calculation method for categorical data(PMS), which is based on the Spark memory computing platform. This algorithm first uses column transformation to transform the data set into multiple data subsets. And then, PMS uses two variable-length arrays to cache intermediate results, solving the problem of large amount of calculation and strong repeatability in categorical data mutual information calculation. Finally, PMS algorithm is implemented and evaluated in a Spark cluster equipped with 24 computing nodes using artificial and real data sets. Experimental results verify that PMS algorithm achieves high performance in terms of efficiency, scalability and scalability.

Key words: column-wise transformation, Parallel Mutual-information computation, categorical data, Spark platform



关键词: 列变换, 并行互信息计算, 分类数据, Spark平台