Computer Engineering and Applications ›› 2007, Vol. 43 ›› Issue (3): 168-168.

• 数据库与信息处理 • Previous Articles     Next Articles

A Method of Classification Based On Content And Hierarchical Structure For XML File

  

  • Received:2006-02-08 Revised:1900-01-01 Online:2007-01-21 Published:2007-01-21

基于内容和分层结构的XML文件自动分类方法

唐凯   

  1. 中国科学院计算技术研究所
  • 通讯作者: 唐凯

Abstract: A new method of classification based on hierarchical structure for XML file is proposed in this paper. Three feature word cluster are separately generated from the content, hierarchical structure, and domain knowledge. They all lead to the classification result. A experiment system is designed to show this method effective and feasible. Key words: Feature word, Text auto classification

Key words: Feature word, Text auto classification

摘要: 提出了一种利用XML文件内在的分层结构为基础的文件分类方法,并与改良的VSM方法的实验结果进行了比较。和以往XML文件的分类方法不同的是,此方法更加注重XML文件特有的结构信息。首先利用TF-IDF方法针对XML文件非结构的信息产生一般特征集,然后再针对XML文件各个层次重要性赋予一定的权重,从而产生层次特征集,然后根据一些领域知识,产生知识特征集,将三个特征集结合起来对XML进行分类。实验结果表明,这种方法比改良的VSM方法在分类的准确性方面有大幅的提高。

关键词: 特征词, 文件自动分类