计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (24): 159-164.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

云计算下保持边界域划分的知识约简算法研究

常玉慧1,2,吕  萍1,2,钱  进1,2   

  1. 1.江苏理工学院 计算机工程学院,江苏 常州 213001
    2.江苏理工学院 云计算与智能信息处理常州市重点实验室,江苏 常州 213001
  • 出版日期:2015-12-15 发布日期:2015-12-30

Knowledge reduction algorithm for boundary region partition in cloud computing

CHANG Yuhui1,2, LV Ping1,2, QIAN Jin1,2   

  1. 1.School of Computer Engineering, Jiangsu University of Technology, Changzhou, Jiangsu 213001, China
    2.Key Laboratory of Cloud Computing & Intelligent Information Processing of Changzhou City, Jiangsu University of Technology, Changzhou, Jiangsu 213001, China
  • Online:2015-12-15 Published:2015-12-30

摘要: 知识约简是数据挖掘应用中知识获取的重要步骤。经典的知识约简算法是一次性将小数据集装入内存中进行知识约简,而传统的并行知识约简仅仅利用任务并行来提高约简算法效率,都无法处理海量数据。通过分析经典的知识约简算法,构建了不可辨识的对象对,提出了保持边界域划分的知识约简算法,并探讨了保持边界域划分的知识约简算法之间的关系。深入剖析了知识约简算法中数据和任务同时并行的可行性,提出了云计算环境下保持边界域划分的知识约简算法框架模型,在Hadoop平台上构建了云计算环境并进行了相关实验。实验结果表明该知识约简算法可以处理海量数据集。

关键词: 云计算, 粗糙集, 知识约简, 数据并行, 任务并行

Abstract: Knowledge reduction in rough set theory is the critical process of knowledge acquisition among data mining applications. Classical knowledge reduction algorithms assume all the datasets can be loaded into the main memory, while the existing parallel knowledge reduction algorithms only implement reduction tasks concurrently, which are infeasible for large-scale datasets. Massive data with high dimension makes attribute reduction a challenging task. To solve this problem, the concept of indiscernibility object pairs is defined and a new knowledge reduction algorithm for boundary region partition preserving is proposed. The relationship among these algorithms is illustrated in detail. Then, the parallelism strategies of data and task parallel are implemented and discussed. The corresponding attribute reduction framework model for boundary region partition preserving is presented. The experimental results demonstrate that knowledge reduction algorithms in cloud computing can efficiently process massive datasets on Hadoop platform.

Key words: cloud computing, rough set, knowledge reduction, data parallel, task parallel