计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (2): 97-103.DOI: 10.3778/j.issn.1002-8331.1912-0215

• 网络、通信与安全 • 上一篇    下一篇

基于空间动态划分的差分隐私聚类算法

张可铧,成卫青   

  1. 1.南京邮电大学 计算机学院,南京 210023
    2.东南大学 计算机网络和信息集成教育部重点实验室,南京 211189
  • 出版日期:2021-01-15 发布日期:2021-01-14

Differential Privacy Clustering Algorithm Based on Spatial Dynamic Partition

ZHANG Kehua, CHENG Weiqing   

  1. 1.School of Computer, Nanjing University of Posts & Telecommunications, Nanjing 210023, China
    2.Key Laboratory of Computer Network & Information Integration of Ministry of Education, Southeast University, Nanjing 211189, China
  • Online:2021-01-15 Published:2021-01-14

摘要:

差分隐私算法作为当前研究较多的隐私保护机制之一,有着广泛应用。目前有多种基于差分隐私保护的[k]均值聚类算法,应用场景不一,各有缺陷。以往的算法通过均等划分数据集,构造等宽直方图进行聚类,这会导致没有数据分布的区域也被无差别插入噪声,影响聚类性能。针对这一点,提出了一种新的差分隐私聚类算法[DPQTk]-means,先通过构建差分隐私四分树,用大小不一的自适应存储桶动态划分数据空间,充分表示数据集同时减少噪声插入,再进行[k]均值聚类,证明了其满足[ε]-差分隐私保护。实验结果表明,[DPQTk]-means算法与以往的差分隐私聚类算法相比具有更好的聚类可用性,且能够在隐私保护水平较高的同时保持稳定的聚类性能。

关键词: 差分隐私, 四分树, 动态划分, [k]均值

Abstract:

As one of the most popular privacy protection mechanisms, the differential privacy algorithm has been widely used. At present, there are a variety of [k]-means clustering algorithms based on differential privacy protection. The application scenarios are different and each has its own defects. There is an algorithm that divides the data set and constructs the equal-width histograms for clustering. This causes the areas without data to be inserted into noise without any difference, which affects clustering performance. To solve this problem, a new differential privacy clustering algorithm [DPQTk]-means is proposed. By constructing a differential privacy quad tree, the data space is dynamically divided by adaptive buckets of different sizes to fully represent the data set and reduce noise insertion, and then do [k]-means clustering. It proves that it satisfies [ε]-differential privacy protection. Experimental results show that the [DPQTk]-means algorithm has better cluster availability than the previous differential privacy clustering algorithms, and can maintain stable clustering performance while maintaining a high level of privacy protection.

Key words: differential privacy, quad tree, dynamic partition, [k]-means