计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (5): 151-153.DOI: 10.3778/j.issn.1002-8331.2009.05.044

• 数据库、信号与信息处理 • 上一篇    下一篇

在未分类英文文档集中挖掘相关词的方法

付仲恺,秦 华   

  1. 北京工业大学 计算机学院,北京,100022
  • 收稿日期:2008-01-10 修回日期:2008-04-14 出版日期:2009-02-11 发布日期:2009-02-11
  • 通讯作者: 付仲恺

Approach for mining associative terms in uncategorized English documents set

FU Zhong-kai,QIN Hua   

  1. College of Computer Science and Technology,Beijing University of Technology,Beijing 100022,China
  • Received:2008-01-10 Revised:2008-04-14 Online:2009-02-11 Published:2009-02-11
  • Contact: FU Zhong-kai

摘要: 在搜索引擎结果相关性判断、文字语音转换与识别等领域中,如何准确地分析单词之间的搭配关系是主要研究问题之一。利用互联网中的海量信息,在对大量英文网页进行统计分析的基础上,利用单词的出现频率和单词对的共现频率归纳总结出了未分类互联网页面中单词相关程度判定的经验性结论,提出了一种基于文档集统计分析的单词相关程度排序方法和计算公式,并根据该方法实现了分布式的英文单词相关性挖掘系统的原型。

关键词: 数据挖掘, 网页分类, 关联规则, 排序算法, 文本表示

Abstract: In the improvement of search engine result,voices recognize fields,how to analyze the relationship between two words exactly is a key point.To analyze and solve this problem,some experiment conclusions are proposed by statistics of frequency of terms and concurrency terms on the basis of considerable English web pages.According to the conclusions,an approach is addressed to calculate ranks of associative terms and a distributed proto-type system is implemented.

Key words: data mining, web-page classification, association rules, sort algorithm, text representation