Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (10): 127-133.DOI: 10.3778/j.issn.1002-8331.1901-0195

Previous Articles     Next Articles

Research on Filtering Algorithm for Senstive Information in Multi-form Uyghur

Yibulayin·Wusiman, GUO Wenqiang, YU Kai   

  1. School of Computer Science and Engineering, Xinjiang University of Finance and Economics, Urumqi 830012, China
  • Online:2020-05-15 Published:2020-05-13

面向多形式维文的敏感信息过滤算法研究

依不拉音·吾斯曼,郭文强,于凯   

  1. 新疆财经大学 计算机科学与工程学院,乌鲁木齐 830012

Abstract:

The existing research on Uyghur sensitive information detection and filtering is limited to traditional Uyghur. Now Uyghur on the Internet uses the “one-word double-text” feature of traditional Uyghur and Latin Uyghur. The sensitive information filtering algorithm of the text realizes the filtering of the sensitive information of traditional Uyghur and Latin Uyghur, which has important practical significance for the network security and social stability of Xinjiang and the realization of the overall goal of lasting stability. The coding rules of Latin Uyghur and traditional Uyghur are studied by putting forward the ULTC(Uyghur Latin Traditional Conversion), which is a code conversion algorithm between them. By adding the Latin Uyghur sensitive information corpora to the existing traditional Uyghur sensitive information corpora, a multi-form Uyghur sensitive information corpus is constructed. Based on the corpus of ULSC(Uyghur Latin Sensitive Corpus), a method for calculating the multi-form Uyghur sensitive values is proposed, and a multi-form Uyghur sensitive information decision tree LUDT(Latin Uyghur Decision Tree) that integrates traditional Uyghur and Latin Uyghur is constructed. Based on LUDT, the multi-form Uyghur Sensitive Information Filtering(USF) algorithm is proposed. Experimental results show that the USF algorithm has a high recall rate.

Key words: traditional Uyghur, Latin Uyghur, sensitive information, decision tree

摘要:

现有的维文敏感信息检测与过滤研究只限于传统维文,而现在互联网上的维文使用呈现传统维文和拉丁维文共存的“一语双文”特点,因此,研究多形式维文的敏感信息过滤算法对新疆的网络安全及社会稳定和长治久安总目标的实现有重要的实际意义。研究拉丁维文和传统维文的Unicode编码特征,提出它们间的编码转换算法ULTC(Uyghur Latin Traditional Conversion),通过该算法在已有的语料库中添加拉丁维文敏感信息语料,从而构建多形式维文敏感信息语料库ULSC(Uyghur Latin Sensitive Corpus);在语料库的基础上构建传统维文和拉丁维文一体化的多形式维文敏感信息决策树LUDT(Latin Uyghur Decision Tree),在决策树的基础上提出多形式维文敏感信息过滤算法USF(Uyghur Sensitive Information Filter)。实验结果表明,USF算法具有较高的查全率。

关键词: 传统维文, 拉丁维文, 敏感信息, 决策树