面向对象语义线索的无监督语义分割研究

doi:10.3778/j.issn.1002-8331.2504-0324

摘要/Abstract

摘要： 在传统的语义分割任务中，广泛依赖像素级标注数据，促使无监督方法逐渐受到关注。近年来，自监督视觉Transformer的深层特征被广泛应用，推动了无监督语义分割的研究进展。然而，由于局部特征编码缺乏显式的对象级语义表示，复杂结构物体的分割仍面临挑战，常导致分割效果不理想。为解决这一问题，提出了一种名为OASES（object-aware segmentation system）的新型无监督语义分割框架，旨在强化面向对象的表示学习。该方法融合了谱分析过程，通过分析深度图像特征的语义相似性矩阵和图像颜色亲和性中提取的特征值，获取语义和结构线索。此外，结合面向对象的对比损失，引导模型学习在图像内外保持一致的对象级语义表示，从而提升语义分割的准确性。在COCO-Stuff和Cityscapes数据集上的大量实验表明，OASES在复杂场景中实现了准确且一致的分割效果，达到了当前领先的无监督语义分割性能。

关键词: 无监督语义分割（USS）, 对象级语义结构线索, 谱分析, 对比学习

Abstract: The reliance on extensive pixel-level annotations in traditional semantic segmentation has led to the exploration of unsupervised approaches. Recent advancements have leveraged the deep features of self-supervised vision Transformers, contributing to progress in unsupervised semantic segmentation (USS). However, segmenting complex objects remains challenging due to the lack of explicit object-level semantic representations in local feature encoding, resulting in inadequate segmentation for objects with intricate structures. To overcome this limitation, a novel USS framework named OASES (object-aware segmentation system) is introduced, focusing on enhancing object-centric representation learning. This method integrates a spectral analysis process, extracting semantic and structural insights by analyzing eigenvalues derived from the semantic similarity matrix of deep image features and the color affinity of images. Moreover, by incorporating an object-centric contrastive loss, the framework encourages the model to learn object-level representations that maintain consistency both within and across images, thereby improving semantic segmentation accuracy. Comprehensive experiments conducted on COCO-Stuff and Cityscapes datasets confirm that OASES achieves state-of-the-art segmentation performance, delivering accurate and consistent results across complex visual scenes.

Key words: unsupervised semantic segmentation(USS), object-level semantic cues, spectral analysis, contrastive learning

贺祺祥, 郭红钰, 陈启志, 刘玉龙. 面向对象语义线索的无监督语义分割研究[J]. 计算机工程与应用, 2025, 61(20): 218-227.

HE Qixiang, GUO Hongyu, CHEN Qizhi, LIU Yulong. Unsupervised Semantic Segmentation Based on Object-Aware Semantic Cues[J]. Computer Engineering and Applications, 2025, 61(20): 218-227.

参考文献

[1] CHEN Y, MANCINI M, ZHU X, et al. Semi-supervised and unsupervised deep visual learning: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(3): 1327-1347.
[2] REN W Q, TANG Y, SUN Q Y, et al. Visual semantic segmentation based on few/zero-shot learning: an overview[J]. CAA Journal of Automatica Sinica, 2024, 11(5): 1106-1126.
[3] HYUN CHO J, MALL U, BALA K, et al. PiCIE: unsupervised semantic segmentation using invariance and equivariance in clustering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 16789-16799.
[4] JI X, VEDALDI A, HENRIQUES J. Invariant information clustering for unsupervised image classification and segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 9864-9873.
[5] HAMILTON M, ZHANG Z, HARIHARAN B, et al. Unsupervised semantic segmentation by distilling feature correspondences[J]. arXiv:2203.08414, 2022.
[6] SEONG H S, MOON W, LEE S, et al. Leveraging hidden positives for unsupervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19540-19549.
[7] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[8] OUALI Y, HUDELOT C, TAMI M. Autoregressive unsupervised image segmentation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 142-158.
[9] HARB R, KN?BELREITER P. Infoseg: Unsupervised semantic image segmentation with mutual information maximization[C]//Proceedings of the German Conference on Pattern Recognition. Cham: Springer International Publishing, 2021: 18-32.
[10] CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9630-9640.
[11] AFLALO A, BAGON S, KASHTI T, et al. DeepCut: unsupervised segmentation using graph neural networks clustering[C]//Proceedings of the IEEE/CVF International Conference on Com-puter Vision Workshops. Piscataway: IEEE, 2023: 32-41.
[12] LUO M, MA Y F, ZHANG H J. A spatial constrained K-means approach to image segmentation[C]//Proceedings of the 4th International Conference on Information, Communications and Signal Processing. Piscataway: IEEE, 2003: 738-742.
[13] NG A, JORDAN M, WEISS Y. On spectral clustering: analysis and an algorithm[C]//Proceedings of the 15th International Conference on Neural Information Processing Systems: Natural and Synthetics, 2001: 849-856.
[14] PAPPAS T N, JAYANT N S. An adaptive clustering algorithm for image segmentation[C]//Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 1989: 1667-1670.
[15] SHI J B, MALIK J. Normalized cuts and image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
[16] BARBATO M P, NAPOLETANO P, PICCOLI F, et al. Unsupervised segmentation of hyperspectral remote sensing images with superpixels[J]. Remote Sensing Applications: Society and Environment, 2022, 28: 100823.
[17] DHANACHANDRA N, MANGLEM K, CHANU Y J. Image segmentation using K-means clustering algorithm and subtractive clustering algorithm[J]. Procedia Computer Science, 2015, 54: 764-771.
[18] KOOHPAYEGANI S A, TEJANKAR A, PIRSIAVASH H. Mean shift for self-supervised learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 10306-10315.
[19] MELAS-KYRIAZI L, RUPPRECHT C, LAINA I, et al. Deep spectral methods: a surprisingly strong baseline for unsupervised semantic segmentation and localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 8354-8365.
[20] DENG Z J, LUO Y C. Learning neural eigenfunctions for unsupervised semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 551-561.
[21] PINHEIRO P O, ALMAHAIRI A, BENMALEK R, et al. Uns-upervised learning of dense visual representations[C]//Advances in Neural Information Processing Systems, 2020: 4489-4500.
[22] WANG X L, ZHANG R F, SHEN C H, et al. Dense contrastive learning for self-supervised visual pre-training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 3023-3032.
[23] GANSBEKE W, VANDENHENDE S, GEORGOULIS S, et al. Unsupervised semantic segmentation by contrasting object mask proposals[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 10032-10042.
[24] ZADAIANCHUK A, KLEINDESSNER M, ZHU Y, et al. Uns-upervised semantic segmentation with self-supervised object-centric representations[J]. arXiv:2207.05027, 2022.
[25] HéNAFF O J, KOPPULA S, SHELHAMER E, et al. Object discovery and representation networks[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 123-143.
[26] HENAFF O J, KOPPULA S, ALAYRAC J B, et al. Efficient visual pretraining with contrastive detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 10066-10076.
[27] WEN X, ZHAO B, ZHENG A, et al. Self-supervised visual representation learning with semantic grouping[C]//Advances in Neural Information Processing Systems, 2022: 16423-16438.
[28] SEITZER M, HORN M, ZADAIANCHUK A, et al. Bridging the gap to real-world object-centric learning[J]. arXiv:2209. 14860, 2022.
[29] CHEEGER J. A lower bound for the smallest eigenvalue of the Laplacian[M]. Princeton: Princeton University Press, 1971.
[30] KR?HENBüHL P, KOLTUN V. Efficient inference in fully connected CRFs with Gaussian edge potentials[J]. arXiv:1210. 56441, 2012.
[31] CAESAR H, UIJLINGS J, FERRARI V. COCO-stuff: thing and stuff classes in context[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 1209-1218.
[32] CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 3213-3223.
[33] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967: 281-298.
[34] SELVARAJU R R, DESAI K, JOHNSON J, et al. Casting your model: learning to localize improves self-supervised representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 11058-11067.