计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (4): 211-219.DOI: 10.3778/j.issn.1002-8331.2209-0338

• 图形图像处理 • 上一篇    下一篇

基于对比学习的矢量化特征空间嵌入聚类

郑洋,吴永明,徐岸   

  1. 1. 贵州大学  省部共建公共大数据国家重点实验室,贵阳  550025
    2. 贵州大学  现代制造教育部重点实验室,贵阳  550025
  • 出版日期:2024-02-15 发布日期:2024-02-15

Vectorized Feature Space Embedded Clustering Based on Contrastive Learning

ZHENG Yang, WU Yongming, XU An   

  1. 1. State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
    2. Key Laboratory of Advanced Manufacturing Technology of Ministry of Education, Guizhou University, Guiyang 550025, China
  • Online:2024-02-15 Published:2024-02-15

摘要: 深度嵌入聚类(deep embedding clustering, DEC)算法只通过自编码器,以单一实例重构的方式将数据嵌入到低维矢量化特征空间中进行聚类,而忽略了不同实例之间的关系,导致可能无法很好地区分嵌入空间中的实例。针对上述问题,提出基于对比学习的矢量化特征空间嵌入聚类(vectorized feature space embedded clustering based on contrastive learning, VECCL)方法。通过对比学习以辨识数据实例之间异同性的方式,从数据中提取出具有同近异远聚类语义的特征,并作为先验知识带入DEC中,引导自编码器初始化带有深层数据信息的低维聚类特征空间。同时利用软分类标签构造熵损失,与自编码器的重构损失一起作为正则化项引入聚类损失函数中,共同细化聚类。实验结果表明,所提方法提取特征的能力更强,与DEC方法在数据集CIFAR10、CIFAR100和STL10上的实验结果相比,ACC分别提升48.1个百分点、23.1个百分点和41.8个百分点,NMI分别提升41.0个百分点、25.2个百分点和39.0个百分点,ARI分别提升45.4个百分点、16.4个百分点和41.8个百分点。

关键词: 深度聚类, 对比学习, 自编码器, 矢量化特征空间, 嵌入聚类

Abstract: The deep embedding clustering (DEC) algorithm only embeds data into a low-dimensional vectorized feature space by autoencoder with a single instance reconstruction for clustering, and ignores the relationship between different instances, which leads to the instances in the embedding space may not be well distinguished from each other. To address the above problems, vectorized feature space embedded clustering based on contrastive learning (VECCL) method is proposed. By contrastive learning to identify the dissimilarity between data instances in a way, features with homogeneous near and different far clustering semantics are extracted from the data and brought into DEC as prior knowledge to guide the autoencoder to initialize a low-dimensional clustering feature space with deep data information. At the same time, the entropy loss constructed by the soft classification label and the reconstruction loss of the autoencoder are introduced into the clustering loss function as a regularization term to jointly refine the clustering. Compared with the experimental results of DEC method on datasets CIFAR10, CIFAR100 and STL10, ACC increaseds by 48.1, 23.1 and 41.8 percentage points, NMI increaseds by 41.0, 25.2 and 39.0 percentage points, and ARI increaseds by 45.4, 16.4 and 41.8 percentage points, respectively.

Key words: deep clustering, contrastive learning, autoencoder, vectorized feature space, embedding clustering