计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (10): 183-185.

• 数据库与信息处理 • 上一篇    下一篇

一种基于双词关联的文本特征选择模型

高茂庭 王正欧   

  1. 上海海事大学 天津大学系统工程研究所
  • 收稿日期:2006-08-14 修回日期:1900-01-01 出版日期:2007-04-01 发布日期:2007-04-01
  • 通讯作者: 高茂庭

A New Model for Text Feature Selection based on Twin Words Relationship

MaoTing Gao ZhengOu Wang   

  • Received:2006-08-14 Revised:1900-01-01 Online:2007-04-01 Published:2007-04-01
  • Contact: MaoTing Gao

摘要: 向量空间模型(VSM)是一种常用的文本特征表示方法,它是基于特征独立性假设建立起来的,将文本看成是由一个个独立的词所构成,这些词之间互不关联,这种方法丢失了文本中词间的一些重要的关联特征信息。基于双词关联的文本特征选择模型是在VSM的基础上,选择文本中相邻的单词之间的关联信息也作为文本特征,从而能更加充分地表达文本的特征信息。实验表明,这是一种更加有效的文本特征选择方法。

关键词: 特征选择, 双词关联, 聚类分析, 文本挖掘

Abstract: Vector Space Model (VSM) is a kind of common way to express text feature in text mining, which is based on the hypothesis of independence between text features. It considers that text is made up of some unattached words which do not associate each other. Twin words relationship based text features model selects the feature relationships between conjoint words in text to express text features fully. Experiments demonstrated that this model is a more effective way to select text features.

Key words: Feature Selection, Twin Words Relationship, Clustering Analysis, Text Mining