维吾尔文Bigram文本特征提取

计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (3): 216-221.

维吾尔文Bigram文本特征提取

阿力木江·艾沙1，3，库尔班·吾布力2，3，吐尔根·依布拉音2，3

1.新疆大学网络与信息技术中心，乌鲁木齐 830046
2.新疆大学信息科学与工程学院，乌鲁木齐 830046
3.新疆多语种信息技术重点实验室，乌鲁木齐 830046

出版日期:2015-02-01 发布日期:2015-01-28

Bigram feature extraction for Uyghur text

Alimjan AYSA1，3, Kurban UBUL2，3, Turgun IBRAHIM2，3

1.Network and Information Technology Center, Xinjiang University, Urumqi 830046, China
2.College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
3.Xinjiang Laboratory of Multi-language Information Technology, Urumqi 830046, China

Online:2015-02-01 Published:2015-01-28

摘要/Abstract

摘要： 文本特征表示是在文本自动分类中最重要的一个环节。在基于向量空间模型（VSM）的文本表示中特征单元粒度的选择直接影响到文本分类的效果。在维吾尔文文本分类中，对于单词特征不能更好地表征文本内容特征的问题，在分析了维吾尔文Bigram对文本分类作用的基础上，构造了一个新的统计量CHIMI，并在此基础上提出了一种维吾尔语Bigram特征提取算法。将抽取到的Bigram作为文本特征，采用支持向量机（SVM）算法对维吾尔文文本进行了分类实验。实验结果表明，与以词为特征的文本分类相比，Bigram作为文本特征能够提高维吾尔文文本分类的准确率和召回率并且通过实验验证了该算法的有效性。

关键词: Bigram文本特征, &chi, ²统计量, 互信息, 维吾尔语

Abstract: Text representation is the most important phase in automatic text categorization. In the vector space model based text representation, the selection of feature granularity has the direct impact on the text categorization performance. The word features don’t have the good representative power to represent the Uyghur texts in text categorization. To solve this problem, the CHIMI based Uyghur Bigram extraction method is proposed and the Uyghur text categorization experiments are conducted using support vector machine algorithm based on the extracted Bigrams as text features. The experimental results show that the Bigram based Uyghur text categorization achieves higher classification precision and recall compared to the word based categorization and experiments demonstrate the effectiveness of the proposed algorithm.

Key words: Bigram text feature, χ² statistics, mutual information, Uyghur Language

阿力木江·艾沙1，3，库尔班·吾布力2，3，吐尔根·依布拉音2，3. 维吾尔文Bigram文本特征提取[J]. 计算机工程与应用, 2015, 51(3): 216-221.

Alimjan AYSA1，3, Kurban UBUL2，3, Turgun IBRAHIM2，3. Bigram feature extraction for Uyghur text[J]. Computer Engineering and Applications, 2015, 51(3): 216-221.

[1]	李俊丽. Spark平台下类别数据互信息计算的并行化[J]. 计算机工程与应用, 2021, 57(7): 95-100.
[2]	李珑珠，林耀进，吕彦，卢舜，王晨曦. 利用邻域信息交互的在线流特征选择算法[J]. 计算机工程与应用, 2021, 57(21): 102-108.
[3]	刘畅，阿布都克力木·阿布力孜，姚登峰，哈里旦木·阿布都克里木. 维吾尔语形态分析研究综述[J]. 计算机工程与应用, 2021, 57(15): 42-61.
[4]	邱云飞，高华聪. 混合Filter与改进自适应GA的特征选择方法[J]. 计算机工程与应用, 2021, 57(11): 95-102.
[5]	陈建促，王越，朱小飞，李章宇，林志航. 融合多特征图的野生动物视频目标检测方法[J]. 计算机工程与应用, 2020, 56(7): 221-227.
[6]	谢心蕊，雷秀仁，赵岩. MI和改进PCA的降维算法在股价预测中的应用[J]. 计算机工程与应用, 2020, 56(21): 139-144.
[7]	曾安，王烈基，潘丹，黄殷. 基于FCN和互信息的医学图像配准技术研究[J]. 计算机工程与应用, 2020, 56(18): 202-208.
[8]	刘永芳，郝晓燕，刘荣. 中国英语新词语料库构建技术研究[J]. 计算机工程与应用, 2020, 56(16): 165-168.
[9]	阿里甫·库尔班1，艾山江·亚生2，张丹丹2. 维吾尔语KP_V句型的文法手语编辑系统的设计[J]. 计算机工程与应用, 2019, 55(7): 248-252.
[10]	张晓琴，刘莉楠. 基于亲密度和吸引力的二分网络社区发现算法[J]. 计算机工程与应用, 2019, 55(23): 170-176.
[11]	帕丽旦·木合塔尔，吾守尔·斯拉木，买买提阿依甫，努尔麦麦提·尤鲁瓦斯. RNN编码器-解码器在维汉机器翻译中的应用[J]. 计算机工程与应用, 2018, 54(15): 235-240.
[12]	朱琼琼1，李平1，杨程1，2，胡检华1. 自适应AP选择无线室内定位算法[J]. 计算机工程与应用, 2018, 54(14): 120-126.
[13]	洪征，田益凡，张洪泽，吴礼发. 基于扩展前缀树的协议格式推断方法[J]. 计算机工程与应用, 2018, 54(12): 14-20.
[14]	杨靖，彭国华. 低帧率周期运动视频超分辨率重建方法[J]. 计算机工程与应用, 2017, 53(8): 174-179.
[15]	姑丽加玛丽·麦麦提艾力1，艾斯卡尔·肉孜2，艾斯卡尔·艾木都拉3. 分层特征模板筛选的维吾尔语韵律边界预测[J]. 计算机工程与应用, 2017, 53(8): 250-253.