计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (30): 28-32.

• 博士论坛 • 上一篇    下一篇

语言自然节奏在文本分类中的研究与应用

陈  钒1,2,冯志勇1   

  1. 1.天津大学 计算机科学与技术学院,天津 300072
    2.天津财经大学 理工学院 信息科学与技术系,天津 300200
  • 出版日期:2012-10-21 发布日期:2012-10-22

Research and application language nature rhythm in documents category

CHEN Fan1,2, FENG Zhiyong1   

  1. 1.School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
    2.Information Science & Technology Department, Tianjin University of Finance and Economics, Tianjin 300200, China
  • Online:2012-10-21 Published:2012-10-22

摘要: 大规模文体分类是一个非常复杂的任务。提出了一种基于语言自然节奏的文本分类方法,通过对语言中标点标记的自然节奏进行分析,获取其特征,应用贝叶斯分类器,可以快速高效地完成文本分类任务。这种文本分类方法与当前主流基于词条特征的文本分类方法不同,不需要理解和分析语义,即无需分析文章中的词条,特征空间小,数据稀疏性现象不明显,文本分类效果显著。

关键词: 文本分类, 标点符号, 语言自然节奏, 状态转移

Abstract: Large scale documents category is very complex in text analysis. A new method based on language nature rhythm. Analysing the feature marked by punctuations in language, using Bayesian classifier, text category can be finished efficiently. This method is different from the others, without understanding any words and semantic. It is easy to get a remarkable effect with a small feature space and weak data sparsity.

Key words: text category, punctuations, language nature rhythm, state switch