计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (12): 147-149.

• 数据库与信息处理 • 上一篇    下一篇

生物医学文本分类方法比较研究

倪茂树 赵晶 林鸿飞   

  1. 大连理工大学模具研究所 大连理工大学计算机系
  • 收稿日期:2006-08-02 修回日期:1900-01-01 出版日期:2007-04-20 发布日期:2007-04-20
  • 通讯作者: 林鸿飞

A comparison study on categorization algorithms for biomedical literatures

Maoshu Ni Jing Zhao Hongfei Lin   

  • Received:2006-08-02 Revised:1900-01-01 Online:2007-04-20 Published:2007-04-20
  • Contact: Hongfei Lin

摘要: 文本分类技术对处理海量的生物医学文献起着重要的作用。TREC(The Text Retrieval Conference)2005 genomics track的测评结果显示,支持向量机(Surport Vector Machine, SVM)在生物医学文本分类问题上,比其他模型具有明显的优势。本文在TREC的测评语料上,使用简单向量距离分类法与SVM进行比较,同时讨论了使用命名实体识别的预处理对不同算法的影响。得出结论:简单向量距离分类法在该领域的效果与SVM不相上下,并且命名实体识别会使结果有一定提高。

关键词: 文本分类, 支持向量机, 简单向量距离分类, 命名实体识别

Abstract: Abstract: Automation text classification can greatly help people to analyze a mass of biomedical literature. The results of TREC2005 genomics track showed that Support Vector Machine has obvious advantages over other models. The paper compares the performance of classification based on distance of simple vectors with those based on SVM on the TREC data sets. The results show that classification based on distance of simple vectors are not worse than those based on SVM in this domain and the pre-process via a named entity recognition can improve the performance.

Key words: Automatic Text Classification, Support Vector Machine, Simply Vector Distance Clustering, Named Entity Recognition