Computer Engineering and Applications ›› 2007, Vol. 43 ›› Issue (12): 147-149.

• 数据库与信息处理 • Previous Articles     Next Articles

A comparison study on categorization algorithms for biomedical literatures

Maoshu Ni Jing Zhao Hongfei Lin   

  • Received:2006-08-02 Revised:1900-01-01 Online:2007-04-20 Published:2007-04-20
  • Contact: Hongfei Lin

生物医学文本分类方法比较研究

倪茂树 赵晶 林鸿飞   

  1. 大连理工大学模具研究所 大连理工大学计算机系
  • 通讯作者: 林鸿飞

Abstract: Abstract: Automation text classification can greatly help people to analyze a mass of biomedical literature. The results of TREC2005 genomics track showed that Support Vector Machine has obvious advantages over other models. The paper compares the performance of classification based on distance of simple vectors with those based on SVM on the TREC data sets. The results show that classification based on distance of simple vectors are not worse than those based on SVM in this domain and the pre-process via a named entity recognition can improve the performance.

Key words: Automatic Text Classification, Support Vector Machine, Simply Vector Distance Clustering, Named Entity Recognition

摘要: 文本分类技术对处理海量的生物医学文献起着重要的作用。TREC(The Text Retrieval Conference)2005 genomics track的测评结果显示,支持向量机(Surport Vector Machine, SVM)在生物医学文本分类问题上,比其他模型具有明显的优势。本文在TREC的测评语料上,使用简单向量距离分类法与SVM进行比较,同时讨论了使用命名实体识别的预处理对不同算法的影响。得出结论:简单向量距离分类法在该领域的效果与SVM不相上下,并且命名实体识别会使结果有一定提高。

关键词: 文本分类, 支持向量机, 简单向量距离分类, 命名实体识别