Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (10): 141-146.

Previous Articles     Next Articles

Improved context graph algorithm by using feature selection based on word frequency differentia

ZHANG Yong, WU Chongzheng   

  1. School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2014-05-15 Published:2014-05-14

基于词频差异特征选取的Context Graph算法改进

张  永,吴崇正   

  1. 兰州理工大学 计算机与通信学院,兰州 730050

Abstract: In order to solve the low efficiency problem of traditional focused crawler, the heuristic web crawler search algorithm Context Graph is analyzed. However, Context Graph method is deficient. An optimization strategy is proposed by adopting the improved TF-IDF and feature selection method based on word frequency differentia, which takes importance of different web textual content into consideration synthetically. A new method of term weighting is explicated in text categorization which considers feature words among and inside class. Compared with the other given algorithms, experimental results indicate that this strategy is more efficient in crawling the topic pages.

Key words: focused crawler, Context Graph, search strategy, feature selection, TF-IDF

摘要: 为了解决传统主题爬虫效率偏低的问题,在分析了启发式网络爬虫搜索算法Context Graph的基础上,提出了一种改进的Context Graph爬虫搜索策略。该策略利用基于词频差异的特征选取方法和改进后的TF-IDF公式对原算法进行了改进,综合考虑了网页不同部分的文本信息对特征选取的影响,及特征词的类间权重和类中权重,以提高特征选取和评价的质量。实验结果表明,与既定传统方法进行实验对照,改进后的策略效率更高。

关键词: 主题爬虫, Context Graph模型, 搜索策略, 特征选取, TF-IDF