Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (6): 66-72.DOI: 10.3778/j.issn.1002-8331.1903-0330

Previous Articles     Next Articles

Dimension Reduction and Visualization of Mixed-Type Data Based on E-t-SNE

WEI Shichao, LI Xin, ZHANG Yichi, ZHOU Xiaofeng, LI Shuai   

  1. 1.Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China
    2.University of Chinese Academy of Sciences, Beijing 100049, China
    3.Key Laboratory of Network Control System, Chinese Academy of Sciences, Shenyang 110016, China
    4.Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110016, China
  • Online:2020-03-15 Published:2020-03-13

基于E-t-SNE的混合属性数据降维可视化方法

魏世超,李歆,张宜弛,周晓锋,李帅   

  1. 1.中国科学院 沈阳自动化研究所,沈阳 110016
    2.中国科学院大学,北京 100049
    3.中国科学院 网络化控制系统重点实验室,沈阳 110016
    4.中国科学院 机器人与智能制造创新研究院,沈阳 110016

Abstract:

Aiming at the problem that the traditional t-SNE algorithm can only deal with single attribute data and can’t handle mixed type data very well. An extended t-SNE dimensionality reduction visualization algorithm named E-t-SNE is proposed. The extension facilitates to handle mixed type data. The concept of information entropy is introduced to construct the distance matrix of categorical data. The distance matrix of mixed type data is constructed by combining the distance between categorical data and the Euclidean distance of numerical data. The combined matrix is used into t-SNE algorithm to reduce the dimension and display it in two-dimensional space. In addition, in order to verify the effectiveness of the algorithm, [k]-Nearest Neighbor[(kNN)] algorithm is used to evaluate. Experiments on UCI datasets show that this method not only has good visualization ability in dealing with mixed attribute data, but also can effectively reduce the dimension of different classes of data and improve the classification accuracy of subsequent classifiers.

Key words: t-SNE algorithm, mixed type data, dimension reduction, visualization

摘要:

针对传统的t分布随机近邻嵌入(t-SNE)算法只能处理单一属型数据,不能很好地处理混合属性数据的问题,提出一种扩展的t-SNE降维可视化算法E-t-SNE,用于处理混合属性数据。该方法引入信息熵概念来构建分类属性数据的距离矩阵,采用分类属性数据距离与数值属性数据欧式距离相结合的方式构建混合属性数据距离矩阵,将新的距离矩阵输入t-SNE算法对数据进行降维并在二维空间可视化展示。此外,为验证算法有效性,采用[k]近邻[(kNN)]算法对混合数据降维后的效果进行评价。通过在UCI数据集上的实验表明,该方法在处理混合属性数据方面,不仅具有较好的可视化能力,而且能有效地对不同类别的数据进行降维分簇,提升后续分类器的分类准确率。

关键词: t-SNE算法, 混合属性数据, 降维, 可视化