计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (17): 88-95.DOI: 10.3778/j.issn.1002-8331.2105-0189

• 理论与研发 • 上一篇    下一篇

基于相关子空间的多源离群检测算法

马洋,赵旭俊   

  1. 太原科技大学 计算机科学与技术学院,太原 030024
  • 出版日期:2021-09-01 发布日期:2021-08-30

Multi-source Outlier Detection Algorithm Based on Relevant Subspace

MA Yang, ZHAO Xujun   

  1. School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China
  • Online:2021-09-01 Published:2021-08-30

摘要:

传统的离群检测方法多数源于单个数据集或多数据源融合后的单一数据集,其检测结果忽略了多源数据之间的关联知识和单数据源中的关键信息。为了检测多源数据之间的离群关联知识,提出一种基于相关子空间的多源离群检测算法RSMOD。结合[k]近邻集和反向近邻集的双向影响,给出面向多源数据的对象影响空间,提高了离群对象度量的准确性;在影响空间基础上,提出面向多源数据的稀疏因子及稀疏差异因子,有效地刻画了数据对象在多源数据中的稀疏程度,重新定义了相关子空间的度量,使其能适用于多源数据集,并给出基于相关子空间的离群检测算法;采用人工合成数据集和真实的美国人口普查数据集,实验验证了RSMOD算法的性能并分析了源于多数据集的离群关联知识。

关键词: 离群检测, 多源数据, 子空间, 数据挖掘, 稀疏因子

Abstract:

Most of the traditional outlier detection methods come from a dataset or a single dataset after multi-source fusion. The detection results ignore the association knowledge among multi-source data sets and some key information in a single data source. To detect the related outlier knowledge among multi-source datasets, this paper proposes a Multi-source Outlier Detection algorithm based on Relevant Subspace(RSMOD). Firstly, this research proposes an object influence space for multi-source data, which uses [k]-nearest-neighbor-set and reverse-nearest-neighbor-set to improve the accuracy of object deviation measurement. Secondly, this paper presents a sparse factor and a sparse difference factor for multi-source data, which can effectively describe the density of data objects in multi-source dataset. Thirdly, after redefining the measurement of relevant subspace, an outlier detection algorithm based on relevant subspace is given. The algorithm can be applied to multi-source datasets. Finally, the performance of RSMOD algorithm is verified by using synthetic datasets and real US census datasets. This paper also analyzes the above experimental results to obtain the outlier association knowledge from multiple datasets.

Key words: outlier detection, multi-source data, subspace, data mining, sparse factor