计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (24): 57-60.DOI: 10.3778/j.issn.1002-8331.1808-0400

• 大数据与云计算 • 上一篇    下一篇

信息熵与模糊综合评判融合的相似数据检测方法

陈  建,张小红   

  1. 江西工程学院,江西 新余 338000
  • 出版日期:2018-12-15 发布日期:2018-12-14

Approximately data detecting method based on fusion of information entropy and fuzzy integrated evaluation

CHEN Jian, ZHANG Xiaohong   

  1. Jiangxi University of Engineering, Xinyu, Jiangxi 338000, China
  • Online:2018-12-15 Published:2018-12-14

摘要: 针对大数据环境下数据冗余量大的问题,以粗糙集理论为基础,提出了一种基于香农信息熵(Shannon entropy)融合模糊综合评判的相似重复数据检测方法,首先基于香农熵对数据集中的属性进行约简,然后采用模糊综合评判方法获取约简后各属性的重要性权值,最后依据约简属性及其权值进行相似数据的检测。理论分析与实验对比表明,该方法在结构化大数据集的相似数据检测中,有较高的检测精度与效率。

关键词: 信息熵, 模糊综合评判, 相似数据, 属性约简, 粗糙集

Abstract: Aiming at the problem of large redundancy of data in big data, an approximately duplicated data detecting method based on Shannon entropy and fuzzy integrated evaluation is proposed. Firstly, attributes in data set are reduced based on Shannon entropy, and then fuzzy integrated evaluation method is adopted to get the weights of the attributes after their reduction, lastly the approximately data is detected according to the reduced attributes and their weights. Theoretical analysis and experimental comparison show that this method has high detection accuracy and efficiency in approximately data detecting of structured big data set.

Key words: information entropy, fuzzy integrated evaluation, approximately data, attribute reduction, rough set