计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (12): 169-174.DOI: 10.3778/j.issn.1002-8331.1904-0072

• 模式识别与人工智能 • 上一篇    下一篇

多重属性过滤深度特征合成算法

王立可,崔小莉,张力戈   

  1. 1.中国科学院 成都计算机应用研究所,成都 610041
    2.中国科学院大学,北京 100049
    3.四川虹信软件股份有限公司,成都 610041
  • 出版日期:2020-06-15 发布日期:2020-06-09

Multi-attribute Filtering Deep Feature Synthesis Algorithm

WANG Like, CUI Xiaoli, ZHANG Lige   

  1. 1.Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610041, China
    2.University of Chinese Academy of Sciences, Beijing 100049, China
    3.Sichuan Rainbow Consulting & Software Co., Ltd., Chengdu 610041, China
  • Online:2020-06-15 Published:2020-06-09

摘要:

传统特征工程从关系实体中提取特征完全倚靠人工,繁琐、费时且易出错,深度特征合成算法可以为结构化数据合成大量特征,实现关系实体的自动特征工程。针对深度特征合成算法中合成特征冗余严重且难以筛选的问题,提出一种基于Kullback-Leibler(KL)散度和Hellinger距离结合的属性过滤算法。通过映射连接实体与标记,度量实体中属性的重要程度,对实体中的属性多重过滤,拒绝实体中重要程度低的属性参与深度特征合成算法,得到优化的特征合成结果。选取三种不同类型的公开数据集在不同的机器学习算法上进行实验验证。结果表明,改进的方法能够明显减少算法运行时间与合成数据规模,有效提高合成特征的质量与最终预测准确率。

关键词: 深度特征合成, 多重属性过滤, KL散度, Hellinger距离

Abstract:

Traditional feature engineering completely relies on manual work to extract features from relational entities, which is tedious, time-consuming and error-prone. Deep feature synthesis algorithm can synthesize a large number of features for structured data and realize automatic feature engineering of relational entities. Aiming at the problem that the synthetic features in deep feature synthesis are difficult to screen and severely redundant, an attribute filtering algorithm based on Kullback-Leibler(KL)?divergence and Hellinger distance is proposed. Through mapping and connecting entities and tags, the importance of attributes in entities is measured, multiple filtering of attributes in entities is conducted, and the attributes with low importance in entities are rejected to participate in the deep feature synthesis algorithm, and the optimized feature synthesis result is obtained. Three different types of open data sets are selected for experimental verification on different machine learning algorithms. The results show that the improved method can significantly reduce the running time of the algorithm and the size of the synthesized data, and effectively improve the quality of the synthesized features and prediction accuracy.

Key words: deep feature synthesis, multiple attribute filtering, Kullback-Leibler(KL) divergence, Hellinger distance