Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (22): 129-131.DOI: 10.3778/j.issn.1002-8331.2009.22.042

• 数据库、信息处理 • Previous Articles     Next Articles

Research on text authorship categorization based on sentence category features

ZHANG Yun-liang1,ZHU Li-jun1,QIAO Xiao-dong1,ZHANG Quan2   

  1. 1.Institute of Scientific & Technical Information of China,Beijing 100038,China
    2.Institute of Acoustics,Chinese Academy of Sciences,Beijing 100080,China
  • Received:2008-10-22 Revised:2008-11-26 Online:2009-08-01 Published:2009-08-01
  • Contact: ZHANG Yun-liang

基于句类特征的作者写作风格分类研究

张运良1,朱礼军1,乔晓东1,张 全2   

  1. 1.中国科学技术信息研究所,北京 100038
    2.中国科学院 声学研究所,北京 100080
  • 通讯作者: 张运良

Abstract: There is a lot of difference in the composition style of different authors and the difference can be discovered by features of word,sentence pattern,rhetoric etc.In this paper,sentence category features are adopted for text categorization and author recognition.This paper uses sentence category vector space model,sentence category features,mixed sentence categories dimensionality reduction,itc weighting method,KNN algorithm and integration decision method to build an authorship classifier. The performance of the authorship classifier is acceptable and can be improved by bigger knowledge base,HNC techniques and machine learning algorithm.

Key words: text classification, authorship, sentence category, Vector Space Model(VSM), Hierarchical Network of Concepts(HNC) theory, nature language processing

摘要: 不同作家的作品有自己的特点,这些特点体现在词汇、句型、修辞手法等各个方面,尝试使用句类特征进行作者写作风格分类,进一步可以用于作者的识别。利用向量空间模型,以句类作为特征,并通过混合句类分解等技术对句类向量空间降维,使用itc算法对特征项进行权重计算,KNN算法进行分类并利用集成判决技术,形成作者写作风格分类器。本分类器的性能在近现代小说的按作者写作风格的分类和鉴别方面的性能是可以接受的,并有进一步提升的可能。

关键词: 文本分类, 作者写作风格, 句类, 向量空间模型, 概念层次网络(HNC)理论, 自然语言理解