Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (2): 53-57.DOI: 10.3778/j.issn.1002-8331.1503-0340

Previous Articles     Next Articles

Improved eliminating duplicated web pages based on feature vector of character set

LI Hongqi, FENG Haibo, ZHANG Wei, YANG Zhongguo, SONG Weicheng   

  1. College of Computer, China Petroleum University, Beijing 102200, China
  • Online:2017-01-15 Published:2017-05-11

基于字集特征向量的网页消重改进算法

李洪奇,冯海波,张  伟,杨中国,宋伟城   

  1. 中国石油大学(北京) 计算机系,北京 102200

Abstract: The eliminating duplicated web pages based on MD5 have been applied in many fields because it is efficient and simple. But the MD5 is a strict algorithm, the result will be very different if input has a little difference. So, this method couldn’t have a higher recalled rate. This page presents two improved algorithms. The thought of algorithms puts web content into words space, and calculates the distance between two words spaces. Then, the similar rate between two sentences is determined by the distance of two spaces. The methods can tolerate the little difference between two pages. Last, the methods’ time complexity is[O(n)], and the space complexity is[O(1)]. It is suitable for eliminating large-scale duplicated web pages.

Key words: alphabet vector, machine code vector, webpage duplicate removal, digital fingerprint, MD5

摘要: 基于MD5算法计算数字指纹的网页消重算法简单而高效,在网页消重领域应用比较广泛。但是由于MD5算法是一种严格的信息加密算法,在文章内容变动很少的情况下得出的指纹结果完全不同,导致基于这种算法的网页消重技术召回率不是很高。提出了两种基于字集特征向量的网页消重改进算法,把文章内容映射到字集空间中去,计算字集空间距离来判断文章是否相似。提出的算法具有良好的泛化能力,段落中存在的调整语序和增删改个别字不会影响到对相似段落的识别,大大提高了网页消重算法的召回率。实验结果表明,算法的时间复杂度为[O(n)],空间复杂度为[O(1)],适合应用于大规模网页消重。

关键词: 字集向量, 机器码向量, 网页消重, 数字指纹, MD5