Computer Engineering and Applications ›› 2012, Vol. 48 ›› Issue (11): 137-142.

Previous Articles     Next Articles

Text categorization model based on WCBVSM and SACA

ZHANG Yanping1,2, LIU Chao1,2, QU Yonghua3   

  1. 1.Key Lab of Intelligent Computing & Signal Processing, MoE, Anhui University, Hefei 230039, China
    2.School of Computer Science and Technology, Anhui University, Hefei 230039, China
    3.School of Computer Science and Technology, Nanjing Normal University, Nanjing 210046, China
  • Online:2012-04-11 Published:2012-04-16

WCBVSM与SACA结合的文本分类模型

张燕平1,2,刘  超1,2,曲永花3   

  1. 1.安徽大学 计算智能与信号处理教育部重点实验室,合肥 230039
    2.安徽大学 计算机科学与技术学院,合肥 230039
    3.南京师范大学 计算机科学与技术学院,南京 210046

Abstract: A new text categorization model based on the method which combines WCBVSM with SACA is proposed. The traditional methods of vector space model adopt the key words as the document semantic carrier. These traditional methods ignore the semantic information between the words of text. According to the word co-occurrence model, the Word Co-Occurrence Model Based VSM(WCBVSM) is presented. The model counts the word co-occurrence information of the texts, and adds this information into VSM. Therefore, it is easy to get the semantic information. In addition, because of the conflict between validity and extensibility in cross covering algorithm, this paper presents a Cross Cover Algorithm based on Simulated Annealing algorithm(SACA). This algorithm improves the situation that the selection of cross cover’s center is random, and reduces the number of cover by increasing the sample number in each cover. It enhances the extensibility of the cover classification. The test results show that the proposed model accelerates the speed of recognition and improves the classification accuracy.

Key words: text categorization, vector space model, term co-occurrence model, simulated annealing algorithm;cross cover algorithm

摘要: 给出了一个词共现改进的向量空间模型(Word Co-Occurrence Mode Based On VSM,WCBVSM)与模拟退火交叉覆盖算法(Cross Cover Algorithm Based On Simulated Annealing Algorithm,SACA)相结合的文本分类新模型。传统的向量空间模型(VSM)采用词条作为文档的语义载体,没有考虑文本上下文词语之间的语义隐含信息,在词共现模型的启发下,提出WCBVSM,它通过统计文本中的词共现信息,加入VSM,以获得文档隐含的语义信息。针对交叉覆盖算法中识别精度与泛化能力之间的一对矛盾,结合模拟退火算法的思想,提出了SACA,改进了传统交叉覆盖在覆盖初始点选取时的随机性,并通过增加每个覆盖所包含的样本点来减少覆盖数,从而增强了覆盖的泛化能力。实验结果表明提出的文本分类新模型在加快识别速度的基础上,提高了分类的精度。

关键词: 文本分类, 向量空间模型, 词共现模型, 模拟退火, 交叉覆盖算法