计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (17): 129-132.DOI: 10.3778/j.issn.1002-8331.2009.17.039

• 数据库、信息处理 • 上一篇    下一篇

一种有效的特征词获取方法

马春华1,朱颢东2,3   

  1. 1.绥化学院 计算机科学与技术系,黑龙江 绥化 152061
    2.中国科学院 成都计算机应用研究所,成都 610041
    3.中国科学院 研究生院,北京 100039
  • 收稿日期:2009-02-23 修回日期:2009-04-03 出版日期:2009-06-11 发布日期:2009-06-11
  • 通讯作者: 马春华

Efficient method of automatically obtaining features

MA Chun-hua1,ZHU Hao-dong2,3   

  1. 1.Computer Science and Technology Department,Suihua College,Suihua,Heilongjiang 152061,China
    2.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu 610041,China
    3.The Graduate School of the Chinese Academy of Sciences,Beijing 100039,China
  • Received:2009-02-23 Revised:2009-04-03 Online:2009-06-11 Published:2009-06-11
  • Contact: MA Chun-hua

摘要: 目前很多知识库中的领域特征主要依靠专家手工来获得,不但费时费力,而且知识库的质量受专家知识与经验的限制。虽然也存在一些领域特征的自动获取方法,但它们提取的特征集大多存在冗余。因此,较具代表性的领域特征集的有效自动获取成为一个亟待解决的问题。首先分析了一些领域词语自动获取方法的不足,对它们加以改进,然后利用改进方法实现在大规模分类语料中自动获取领域词语的目的,最后利用粗集理论对所得领域词语集进行属性约简,从而得到冗余度低、代表性好的领域特征集。实验验证了所提方法的有效性和实用性。

关键词: 知识库, 特征提取, 粗集, 属性约简

Abstract: At present,field features in many knowledge bases mainly rely on experts by hand to obtain,this process is not only time-consuming and laborious,but also the quality of knowledge base is restricted by knowledge and experience of experts.Although there are some methods of automatically obtaining field features,but obtained feature sets are redundant and not representative.So it is an urgent problem that how to design the method of automatically obtaining representative and low redundant field feature subsets.The thesis firstly analyzes some methods of automatically obtaining field features and finds out their shortcomings,and then presents an improved feature extraction method.Finally,combining the improved method with rough sets to automatically obtain lower redundant and higher representative field feature subsets of large-scale corpus.Experimental results show that the comprehensive algorithm is efficient and practical.

Key words: knowledge base, feature extraction, rough set, attribute reduction