计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (20): 162-164.DOI: 10.3778/j.issn.1002-8331.2008.20.049

• 数据库、信号与信息处理 • 上一篇    下一篇

基于概念的文本表示模型

陈 龙,范瑞霞,高 琪   

  1. 北京理工大学 模式识别与智能系统研究所,北京 100081
  • 收稿日期:2007-09-27 修回日期:2008-01-23 出版日期:2008-07-11 发布日期:2008-07-11
  • 通讯作者: 陈 龙

Model of text representation based on concept

CHEN Long,FAN Rui-xia,GAO Qi   

  1. Beijing Instituts of Technology,Beijing 100081,China
  • Received:2007-09-27 Revised:2008-01-23 Online:2008-07-11 Published:2008-07-11
  • Contact: CHEN Long

摘要: 文本信息处理正朝着语义的方向发展,而当今主流的文本表示模型——向量空间模型(VSM)以单个词语作为特征项,这忽略了自然语言中词语之间的语义联系、导致文本中大量存在同义词与多义词现象,从而严重地降低了文本信息处理的精度。应用自然语言处理相关技术和成果,把概念和概念距离引入向量空间模型,从语义、概念的角度出发,以概念作为文本的特征项,建立基于概念的文本表示模型。实验证明:这种方法能较好地解决同义词和多义词问题、提高了文本分类的查全率和查准率。

关键词: 文本表示模型, 概念, 概念距离

Abstract: The information processing of text is advancing towards semantic direction,but nowadays the dominating model of text representation,which is called the Vector Space Model uses a single word to be the characteristic item.It neglects the lexical relation between words,thereby leading to a low precision of text information processing due to the fact that synonymy and polysemy exist in large numbers in natural languages.This paper uses the techniques and results of natural language processing,and introduces concept and distance of concept into the Vector Space Model.An improved model of text representation is then built based on concept as a characteristic item of the text from the perspective of semantics and concept.Proved by experiments,this method can resolve the synonymous and polysemantic problems commendably,improve the precision and recall to a great extent.

Key words: text representation model, concept, distance of concept