计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (14): 148-155.DOI: 10.3778/j.issn.1002-8331.1905-0032

• 模式识别与人工智能 • 上一篇    下一篇

聚类+连体段判别的维吾尔文档图像单词切分

徐学斌,吾尔尼沙·买买提,阿力木江·艾沙,朱亚俐,库尔班·吾布力   

  1. 1.新疆大学 信息科学与工程学院,乌鲁木齐 830046
    2.新疆大学 图书馆,乌鲁木齐 830046
    3.新疆大学 教师工作部,乌鲁木齐 830046
  • 出版日期:2020-07-15 发布日期:2020-07-14

Word Segmentation of Uyghur Image Based on Clustering and Conjoined Segment Identification

XU Xuebin, Hornisa Mamat, Alim Aysa, ZHU Yali, Kurban Ubul   

  1. 1.College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
    2.Library, Xinjiang University, Urumqi 830046, China
    3.Department of Teacher Affairs, Xinjiang University, Urumqi 830046, China
  • Online:2020-07-15 Published:2020-07-14

摘要:

目前针对印刷体维吾尔文档图像的切分研究主要集中在字母切分上,单词切分的文献较少,且存在着标点符号难处理,未合并被拆分书写的单词等问题,同时单词切分准确率有待进一步提高。在对文档图像进行投影处理的基础上,通过[K]均值聚类算法[(K]-means)对文本行中所有连体段之间的间隙进行聚类分析得出最佳的间隙判别阈值,然后对所有连体段进行筛选和粗略识别,并结合对间隙的阈值判别结果来确定单词的精确切分点和获取被拆分书写单词的位置信息。在选取的100张文档图像中测试时,结果表明该方法能有效去除标点符号对切分结果的影响,准确合并被拆分书写的单词,并且平均单词切分准确率保持在99%以上。

关键词: 维吾尔文, 文档图像, 单词切分, [K]-means, 连体段判别, 单词拆分

Abstract:

At present, the research on the segmentation of printed Uyghur document images mainly focuses on the segmentation of letters, and there are few literatures on word segmentation. In addition, existing literatures are difficult to deal with punctuation marks, and the words written by splitting are not merged, etc. In this paper, firstly, the document image is projected, and then the gap between all conjoined segments in the text line is clustering analyzed by [k]-means clustering algorithm, and the optimal gap discrimination threshold is obtained. Then all conjoined segments are screened and roughly identified, and combined with the threshold discrimination result of the gap to determine the exact segmentation point of the word and obtain the position information of the segmented written word. In the experiment of 100 selected document images, the results show that the proposed method can effectively remove the influence of punctuation on the segmentation results, accurately merge the segmented written words, and keep the average word segmentation accuracy above 99%.

Key words: Uyghur script, documents image, word segmentation, [K]-means, conjoined segment identification, word splitting