Word Segmentation of Uyghur Image Based on Clustering and Conjoined Segment Identification

doi:10.3778/j.issn.1002-8331.1905-0032

Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (14): 148-155.DOI: 10.3778/j.issn.1002-8331.1905-0032

Previous Articles Next Articles

Word Segmentation of Uyghur Image Based on Clustering and Conjoined Segment Identification

XU Xuebin, Hornisa Mamat, Alim Aysa, ZHU Yali, Kurban Ubul

1.College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2.Library, Xinjiang University, Urumqi 830046, China
3.Department of Teacher Affairs, Xinjiang University, Urumqi 830046, China

Online:2020-07-15 Published:2020-07-14

聚类+连体段判别的维吾尔文档图像单词切分

徐学斌，吾尔尼沙·买买提，阿力木江·艾沙，朱亚俐，库尔班·吾布力

1.新疆大学信息科学与工程学院，乌鲁木齐 830046
2.新疆大学图书馆，乌鲁木齐 830046
3.新疆大学教师工作部，乌鲁木齐 830046

Abstract

Abstract:

At present, the research on the segmentation of printed Uyghur document images mainly focuses on the segmentation of letters, and there are few literatures on word segmentation. In addition, existing literatures are difficult to deal with punctuation marks, and the words written by splitting are not merged, etc. In this paper, firstly, the document image is projected, and then the gap between all conjoined segments in the text line is clustering analyzed by [k]-means clustering algorithm, and the optimal gap discrimination threshold is obtained. Then all conjoined segments are screened and roughly identified, and combined with the threshold discrimination result of the gap to determine the exact segmentation point of the word and obtain the position information of the segmented written word. In the experiment of 100 selected document images, the results show that the proposed method can effectively remove the influence of punctuation on the segmentation results, accurately merge the segmented written words, and keep the average word segmentation accuracy above 99%.

Key words: Uyghur script, documents image, word segmentation, [K]-means, conjoined segment identification, word splitting

摘要：

目前针对印刷体维吾尔文档图像的切分研究主要集中在字母切分上，单词切分的文献较少，且存在着标点符号难处理，未合并被拆分书写的单词等问题，同时单词切分准确率有待进一步提高。在对文档图像进行投影处理的基础上，通过[K]均值聚类算法[（K]-means）对文本行中所有连体段之间的间隙进行聚类分析得出最佳的间隙判别阈值，然后对所有连体段进行筛选和粗略识别，并结合对间隙的阈值判别结果来确定单词的精确切分点和获取被拆分书写单词的位置信息。在选取的100张文档图像中测试时，结果表明该方法能有效去除标点符号对切分结果的影响，准确合并被拆分书写的单词，并且平均单词切分准确率保持在99%以上。

关键词: 维吾尔文, 文档图像, 单词切分, [K]-means, 连体段判别, 单词拆分

XU Xuebin, Hornisa Mamat, Alim Aysa, ZHU Yali, Kurban Ubul. Word Segmentation of Uyghur Image Based on Clustering and Conjoined Segment Identification[J]. Computer Engineering and Applications, 2020, 56(14): 148-155.

徐学斌，吾尔尼沙·买买提，阿力木江·艾沙，朱亚俐，库尔班·吾布力. 聚类+连体段判别的维吾尔文档图像单词切分[J]. 计算机工程与应用, 2020, 56(14): 148-155.

[1]	WANG Changlong, ZHANG Yuandong, MIAO Hong, YANG Yuheng. Application of Double Channel Convolutional Neural Network in Pumpkin Diseases Identification [J]. Computer Engineering and Applications, 2021, 57(5): 183-189.
[2]	ZHANG Ziran, HUANG Weihua, CHEN Yang, ZHANG Zheng, LI Ziyuan. Improved Ant Colony Path Planning Algorithm Based on Bidirectional Search [J]. Computer Engineering and Applications, 2021, 57(21): 270-277.
[3]	CHENG Jingyi, DUAN Xianhua, ZHU Wei. Research on Metal Surface Defect Detection by Improved YOLOv3 [J]. Computer Engineering and Applications, 2021, 57(19): 252-258.
[4]	PAN Chengsheng, ZHANG Bin, LYU Yana, DU Xiuli, QIU Shaoming. K-Means Text Clustering Based on Improved Gray Wolf Optimization Algorithm [J]. Computer Engineering and Applications, 2021, 57(1): 188-193.
[5]	GAO Weijun, SHI Yang, YANG Jie, ZHANG Chunxia. An Improved Lightweight Head Detection Method [J]. Computer Engineering and Applications, 2021, 57(1): 207-212.
[6]	LU Junjie, HUANG Jinquan, LU Feng. Likelihood K-means Clustering for Gas Path Failure Diagnostics of Turbofan Engine [J]. Computer Engineering and Applications, 2020, 56(9): 136-141.
[7]	ZONG Xiaoping, TIAN Weiqian. Segmentation and Feature Extraction of Brain Tumor Based on Magnetic Resonance Image Using K-means [J]. Computer Engineering and Applications, 2020, 56(3): 187-193.
[8]	WANG Weihong, ZENG Yingjie. Collaborative Filtering Recommendation Algorithm Based on Clustering and User Preference [J]. Computer Engineering and Applications, 2020, 56(3): 68-73.
[9]	WANG Zilong, LI Jin, SONG Yafei. Improved K-means Algorithm Based on Distance and Weight [J]. Computer Engineering and Applications, 2020, 56(23): 87-94.
[10]	ZHANG Zhen, LI Haofang, LI Mengzhou. Research on YOLO Algorithm in Abnormal Security Images [J]. Computer Engineering and Applications, 2020, 56(21): 187-193.
[11]	TU Wenbo, YUAN Zhenming, YU Kai. Convolutional Neural Networks Without Pooling Layer for Chinese Word Segmentation [J]. Computer Engineering and Applications, 2020, 56(2): 120-126.
[12]	MA Jinghui, PAN Wei, WANG Ru. 3D Point Cloud Classification Based on K-means Clustering [J]. Computer Engineering and Applications, 2020, 56(17): 181-186.
[13]	MA Keqin, YANG Yanjiao, QIN Hongwu, GENG Lin, WANG Pidong. K-means Clustering Algorithm Combining Max-Min Distance and Weighted Density [J]. Computer Engineering and Applications, 2020, 56(16): 50-54.
[14]	GUO Yongkun, ZHANG Xinyou, LIU Liping, DING Liang, NIU Xiaolu. K-means Clustering Algorithm of Optimizing Initial Clustering Center [J]. Computer Engineering and Applications, 2020, 56(15): 172-178.
[15]	LI Feng, LI Mingxiang, ZHANG Yujing. Partial Iterative Fast K-means Clustering Algorithm [J]. Computer Engineering and Applications, 2020, 56(13): 63-71.

Word Segmentation of Uyghur Image Based on Clustering and Conjoined Segment Identification

聚类+连体段判别的维吾尔文档图像单词切分

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics