Computer Engineering and Applications ›› 2007, Vol. 43 ›› Issue (15): 170-172.
• 数据库与信息处理 • Previous Articles Next Articles
HAO Chun-feng1,WANG Zhong-min2
Received:
Revised:
Online:
Published:
Contact:
郝春风1,王忠民2
通讯作者:
Abstract: Along with the technical development of network and information technology,the text categorization becomes the key technique on processing and organizing large scale of texts.How to characterize the text exactly as a data set that can be processed is a key problem that restricts the improvement of text categorization results seriously.The author brings up a formula used to characterize the text named p-idf based on the vector space model and tf-idf.After comparing Bayes,K neighbors,neural network and support vector machine these four typical text categorization devices,the author builds a text categorization system using support vector machine.After a scientifical test which displays the impact to the text categorization system of these three formula tf-idf,p-idf,LTC,we can conclude that the p-idf formula is reasonable and valid to a text categorization system.
Key words: text categorization, vector space model, p-idf, Support Vector Machine(SVM)
摘要: 随着网络和信息技术的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术。文本的特征表示严重地限制了文本分类性能的提升。以经典的向量空间模型和tf-idf权值计算公式为基础,提出了以应用于文本分类为目的的权值改进公式p-idf公式。在比较了贝叶斯、K近邻、神经网络和支持向量机四种典型的文本分类器的基础上,采用支持向量机分类器搭建了一个文本分类试验系统。经过科学的试验比较了tf-idf、p-idf、LTC三种权值公式在文本分类系统中对分类器性能的影响,证实了所提出的p-idf公式的合理性和有效性。
关键词: 文本分类, 向量空间模型, p-idf, 支持向量机
HAO Chun-feng1,WANG Zhong-min2. Method of expressing features used for large-scale text classification[J]. Computer Engineering and Applications, 2007, 43(15): 170-172.
郝春风1,王忠民2. 一种用于大规模文本分类的特征表示方法[J]. 计算机工程与应用, 2007, 43(15): 170-172.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/
http://cea.ceaj.org/EN/Y2007/V43/I15/170