FD-means Clustering Cleaning Algorithm for Near-Duplicate Videos

doi:10.3778/j.issn.1002-8331.2007-0515

Abstract

Abstract: In recent years, with the continuous increase in the scale of video data, near-duplicate video data continue to emerge, and the quality issue of video data has become more and more prominent. The video data quality can be improved through the near-duplicate videos cleaning method. However, there are few studies on near-duplicate videos cleaning, and mainly focusing on the research of near-duplicate video detection. Although the exist methods can effectively identify near-duplicatevideo data, they are difficult to automatically cleaning near-duplicate videos data and improve the quality of video data while ensuring data integrity. In order to address the above problem, it proposes a near-duplicate videos cleaning method based on deep network and FD-means clustering fusion. This method firstly uses the MOG2 model and median filter algorithm to perform background segmentation and foreground denoising. Secondly, the VGG-16 deep network model is used to extract the depth spatial features of videos. Finally, a new FD-means clustering algorithm is constructed to update the cluster center points through the generating near-duplicate video cluster siteratively, and finally near-duplicate video data are deleted outside the center points in the cluster. The experimental results show that the proposed method can effectively clean the near-duplicate videos automatically and improve data quality of the video.

Key words: video data quality, near-duplicate videos, videos cleaning, VGG-16 deep network, FD-means（feature distance-means） clustering

摘要： 近几年，随着视频数据规模的不断增加，近重复视频数据不断涌现，视频的数据质量问题越来越突出。通过近重复视频清洗方法，有助于提高视频集的数据质量。然而，目前针对近重复视频清洗问题的研究较少，主要集中于近重复视频检索等方面的研究。现有研究方法尽管可以有效识别近重复视频，但较难在保证数据完整性的前提下，自动清洗近重复视频数据，以便改善视频数据质量。为解决上述问题，提出一种融合VGG-16深度网络与FD-means（feature distance-means）聚类的近重复视频清洗方法。该方法借助MOG2模型和中值滤波算法对视频进行背景分割和前景降噪；利用VGG-16深度网络模型提取视频的深度空间特征；构建一种新的FD-means聚类算法模型，通过迭代产生的近重复视频簇，更新簇类中心点，并最终删除簇中中心点之外的近重复视频数据。实验结果表明，该方法能够有效解决近重复视频数据清洗问题，改善视频的数据质量。

关键词: 视频数据质量, 近重复视频, 视频清洗, VGG-16深度网络, FD-means聚类

FU Yan, HAN Ze, YE Ou. FD-means Clustering Cleaning Algorithm for Near-Duplicate Videos[J]. Computer Engineering and Applications, 2022, 58(1): 197-203.

付燕, 韩泽, 叶鸥. 针对近重复视频的FD-means聚类清洗算法[J]. 计算机工程与应用, 2022, 58(1): 197-203.

References

[1] 黄凯奇，陈晓棠，康运锋，等.智能视频监控技术综述[J].计算机学报，2015，38（6）：1093-1118.
HUANG K Q，CHEN X T，KANG Y F，et al.Intelligent visual surveillance：A review[J].Chinese Journal of Computers，2015，38（6）：1093-1118.
[2] 叶鸥，李占利.视频数据质量与视频数据检测技术[J].西安科技大学学报，2017，37（6）：919-926.
YE O，LI Z L.Video quality and video data detection technology[J].Journal of Xi’an University of Science and Technology，2017，37（6）：919-926.
[3] RAHM E，DO H.Data cleaning：problems and current approaches[J].IEEE Data Engineering Bulletin，2000，23（4）：3-13.
[4] JIANG Q，HE Y，LI G，et al.SVD：A large-scale short video dataset for near-duplicate video retrieval[C]//Proceedings of IEEE International Conference on Computer Vision，2019：5281-5289.
[5] LUAN X，XIE Y，HE J，et al.Near-duplicate video detection algorithm based on global gsp feature and local scsift feature fusion[J].Journal of Physics Conference Series，2018，960：1-7.
[6] WU X，HAUPTMANN A，NGO C.Practical elimination of near-duplicates from web video search[C]//Proceedings of the 15th ACM International Conference on Multimedia，2007：218-227.
[7] KORDOPATIS-ZILOS G，PAPADOPOULOS S，PATRAS I，et al.Near-duplicate video retrieval with deep metric learning[C]//Proceedings of IEEE International Conference on Computer Vision Workshops，2017：347-356.
[8] HU Y，LU X.Learning spatial-temporal features for video copy detection by the combination of CNN and RNN[J].Journal of Visual Communication & Image Representation，2018，55：21-29.
[9] LI J，ZHANG H，WAN W，et al.Two-class 3D-CNN clas-sifiers combination for video copy detection[J].Multimedia Tools & Applications，2020，79：4749-4761.
[10] SHEN H T，ZHOU X F，HUANG Z，et al.Uqlips：A real-time near-duplicate video clip detection system[C]// Proceedings of the 33rd International Conference on Very Large Data Bases，2007：1374-1377.
[11] HAO Y B，MU T T，HONG R C，et al.Stochastic multiview hashing for large-scale near-duplicate video retrieval[J].IEEE Transactions on Multimedia，2017，19（1）：1-14.
[12] 蔡莉，梁宇，朱扬勇，等.数据质量的历史沿革和发展趋势[J].计算机科学，2018，45（4）：1-10.
CAI L，LIANG Y，ZHU Y Y，et al.History and development tendency of data quality[J].Computer Science，2018，45（4）：1-10.
[13] ZIVKOVIC Z.Improved adaptive gaussian mixture model for back-ground subtraction[C]//Proceedings of the 17th International Conference on Pattern Recognition，2004：28-31.
[14] 郝爽，李国良，冯建华，等.结构化数据清洗技术综述[J].清华大学学报（自然科学版），2018，58（12）：1037-1050.
HAO S，LI G L，FENG J H，et al.Survey of structured data cleaning methods[J].Journal of Tsinghua University（Science and Technology），2018，58（12）：1037-1050.
[15] JIE L，HENK V，YUAN SD，et al.Urban travel time data cleaning and analysis for automatic number plate recognition[J].Transportation Research Procedia，2020，47：712-719.
[16] WANG H Z，DING X O，CHEN X Y，et al.CleanCloud：Cleaning big data on cloud[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management，2017：2543-2546.
[17] 杨东华，李宁宁，王宏志，等.基于任务合并的并行大数据清洗过程优化[J].计算机学报，2016，39（1）：97-108.
YANG D H，LI N N，WANG H Z，et al.The optimization of the big data cleaning based on task merging[J].Chinese Journal of Computers，2016，39（1）：97-108.
[18] YE O，LI Z，ZHANG Y.Near-duplicate video cleansing method based on locality sensitive hashing and the sorted neighborhood method[C]//Proceedings of the International Conference on Robotic Sensor Networks，2020：129-139.
[19] SIMONYAN K，ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the International Conference on Learning Representations，2015：1-14.
[20] 杨俊闯，赵超.K-means聚类算法研究综述[J].计算机工程与应用，2019，55（23）：7-14.
YANG J C，ZHAO C.Survey on K-means clustering algorithm[J].Computer Engineering and Applications，2019，55（23）：7-14.
[21] 卢光明，杨文，廖庆敏.基于局部纹理分析的虹膜识别算法[J].计算机应用，2007，27（6）：1490-1492.
LU G M，YANG W，LIAO Q M.Iris recognition based on the analysis of local regions[J].Journal of Computer Applications，2007，27（6）：1490-1492.
[22] WANG F，WANG Q，NIE F，et al.A linear multivariate binary decision tree classifier based on k-means splitting[J].Pattern Recognition，2020，107：1-13.