Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (2): 53-57.DOI: 10.3778/j.issn.1002-8331.1503-0340
Previous Articles Next Articles
LI Hongqi, FENG Haibo, ZHANG Wei, YANG Zhongguo, SONG Weicheng
Online:
Published:
李洪奇,冯海波,张 伟,杨中国,宋伟城
Abstract: The eliminating duplicated web pages based on MD5 have been applied in many fields because it is efficient and simple. But the MD5 is a strict algorithm, the result will be very different if input has a little difference. So, this method couldn’t have a higher recalled rate. This page presents two improved algorithms. The thought of algorithms puts web content into words space, and calculates the distance between two words spaces. Then, the similar rate between two sentences is determined by the distance of two spaces. The methods can tolerate the little difference between two pages. Last, the methods’ time complexity is[O(n)], and the space complexity is[O(1)]. It is suitable for eliminating large-scale duplicated web pages.
Key words: alphabet vector, machine code vector, webpage duplicate removal, digital fingerprint, MD5
摘要: 基于MD5算法计算数字指纹的网页消重算法简单而高效,在网页消重领域应用比较广泛。但是由于MD5算法是一种严格的信息加密算法,在文章内容变动很少的情况下得出的指纹结果完全不同,导致基于这种算法的网页消重技术召回率不是很高。提出了两种基于字集特征向量的网页消重改进算法,把文章内容映射到字集空间中去,计算字集空间距离来判断文章是否相似。提出的算法具有良好的泛化能力,段落中存在的调整语序和增删改个别字不会影响到对相似段落的识别,大大提高了网页消重算法的召回率。实验结果表明,算法的时间复杂度为[O(n)],空间复杂度为[O(1)],适合应用于大规模网页消重。
关键词: 字集向量, 机器码向量, 网页消重, 数字指纹, MD5
LI Hongqi, FENG Haibo, ZHANG Wei, YANG Zhongguo, SONG Weicheng. Improved eliminating duplicated web pages based on feature vector of character set[J]. Computer Engineering and Applications, 2017, 53(2): 53-57.
李洪奇,冯海波,张 伟,杨中国,宋伟城. 基于字集特征向量的网页消重改进算法[J]. 计算机工程与应用, 2017, 53(2): 53-57.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.1503-0340
http://cea.ceaj.org/EN/Y2017/V53/I2/53