Computer Engineering and Applications ›› 2016, Vol. 52 ›› Issue (22): 86-92.

Previous Articles     Next Articles

N-grams feature selection and weighting algorithm based on single-word matrix intersection

QIU Yunfei1, LIU Shixing1, SHAO Liangshan2   

  1. 1.School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
    2.System Engineering Institute, Liaoning Technical University, Huludao, Liaoning 125105, China
  • Online:2016-11-15 Published:2016-12-02

基于字矩阵交运算的n-grams特征选择加权算法

邱云飞1,刘世兴1,邵良杉2   

  1. 1.辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105
    2.辽宁工程技术大学 系统工程研究所,辽宁 葫芦岛 125105

Abstract: In Chinese text, traditional n-grams feature selection and weighting methods(Sliding window method and so on) have two shortages: the word segmentation must be called before words’ combination and n-grams’ generation. The redundancy n-grams disturb other useful n-grams and reduce the precision of classification because of the redundancy words in n-grams that can’t be deleted. To solve the problems, transform the text to single-word matrix according to Chinese single, double word identification theory. Avoid redundancy words existing in n-grams and calling word segmentation to text by redundancy filtering and intersection in single-word matrix. The experiment results in Sogou Chinese news corpus and NetEase text corpus show that compared with sliding window and other methods, the n-grams features using the method of n-grams feature selection and weighting algorithm based on single-word matrix intersection cost less time and behave better in SVM(Support Vector Machine).

Key words: Chinese single and double word recognition, single-word matrix, intersection, feature selection, feature weighting

摘要: 中文文本中,传统的n-grams特征选择加权算法(如滑动窗口法等)存在两点不足:在将每个词进行组合、生成n-grams特征之前必须对每篇文本调用分词接口。无法删除n-grams中的冗余词,使得冗余的n-grams特征对其他有用的n-grams特征产生干扰,降低分类准确率。为解决以上问题,根据汉语单、双字词识别研究理论,将文本转化为字矩阵。通过对字矩阵中元素进行冗余过滤和交运算得到n-grams特征,避免了n-grams特征中存在冗余词的情况,且不需对文本调用任何分词接口。在搜狗中文新闻语料库和网易文本语料库中的实验结果表明,相比于滑动窗口法和其他n-grams特征选择加权算法,基于字矩阵交运算的n-grams特征选择加权算法得到的n-grams特征耗时更短,在支持向量机(Support Vector Machine,SVM)中的分类效果更好。

关键词: 汉语单双字识别, 字矩阵, 交运算, 特征选择, 特征加权