Characters Segmentation Method of Historical Documents Mixed in Korean and Chinese

doi:10.3778/j.issn.1002-8331.1902-0119

Abstract

Abstract:

To solve the character segmentation problem for Korean historical document digitization, the paper proposes an effective character segmentation algorithm. In the algorithm, it first divides the document according to columns based on connected component rule and projection method which can handle the scenario of discontinuity separator lines, skew or joined characters contained in Korean historical documents. And then, the characters are segmented by employing the operation of deletion, merging and splitting on the connected components. It uses a multi-step technique which makes full use of the characteristics of different character sizes, horizontal and vertical mixed arrangement in the text image to complete this segmentation. For connected characters, an improved drop fall algorithm is adopted to get effective segmentation. The experimental results show that the proposed algorithm can effectively accomplish the segmentation of Korean history documents which have multi-language, different character size and complex arrangement. In the dataset, the accuracy of column segmentation and character segmentation can achieve 97.69% and 87.79% separately.

Key words: ancient books digitalization, Korean historical documents, column segmentation, character segmentation

摘要：

为解决朝鲜语古籍数字化中朝汉文种混排字符切分困难的问题，提出一种朝鲜语古籍图像的文字切分算法。针对古籍列与列之间存在不连续间隔线、倾斜或者粘连等问题，提出一种基于连通域投影的列切分方法。利用连通域的删除、合并、拆分等操作对文字进行切分。使用一种多步切分法完成了具有文字大小不一，横向、纵向混合排版特点图像的字符切分工作。对于粘连字，采用改进的滴水算法进行有效切分。实验结果表明所提出的算法能够很好地完成朝、汉文种混排，文字大小不一，排版情况复杂的朝鲜语古籍图像的文字切分工作。该算法的列切分准确率为97.69%，字切分准确率为87.79%。

关键词: 古籍数字化, 朝鲜语古籍, 列切分, 字符切分

LIU Xingchen, JIN Xiaofeng. Characters Segmentation Method of Historical Documents Mixed in Korean and Chinese[J]. Computer Engineering and Applications, 2020, 56(11): 135-141.

刘星辰，金小峰. 朝汉混排古籍的文字切分方法[J]. 计算机工程与应用, 2020, 56(11): 135-141.

[1]	XIAO Xihua, JIANG Zhixing, LIANG Xu, LI Yanxia. Research on character segmentation method for ID card image with mobile platform [J]. Computer Engineering and Applications, 2015, 51(24): 201-204.
[2]	LI Jing, LU Kaixuan. Research on second-generation ID card automatic segmentation method [J]. Computer Engineering and Applications, 2015, 51(14): 165-169.
[3]	MENG Wei1, ZHONG Na2. Application of improved SURF algorithm in image Chinese character recognition [J]. Computer Engineering and Applications, 2015, 51(12): 156-160.
[4]	YUAN Weiqi, JIN Can. Paper currency number recognition method based on structural features [J]. Computer Engineering and Applications, 2014, 50(8): 118-121.
[5]	WANG Lei1，2, WANG Hanli1，2, HE Lianghua1，2. License plate recognition based on double-edge detection [J]. Computer Engineering and Applications, 2013, 49(8): 169-173.
[6]	NI Enzhi1, JIANG Minjun2, ZHOU Changle1. Research on segmentation of historical Chinese books [J]. Computer Engineering and Applications, 2013, 49(2): 29-33.
[7]	ZENG Zhongjie1, PAN Qing1, XU Ruyi2, CAI Nian1, XU Shaoqiu1. Segment LCD character recognition based on segmentation and splice [J]. Computer Engineering and Applications, 2013, 49(12): 110-112.
[8]	MU Lijuan1, JI Yan2. New model of neighborhood used in segmentation of vehicle plate [J]. Computer Engineering and Applications, 2012, 48(19): 191-196.
[9]	GUO Yan^1，2，ZENG Li^1，2，LI Zongjian^2，3. Enhancement and segmentation of casting DR image’s workpiece characters [J]. Computer Engineering and Applications, 2011, 47(6): 216-218.
[10]	YU Ming，ZHANG Yan-yun，XUE Cui-hong，SUN Lin-juan. Image segmentation algorithm of single handwritten Chinese characters [J]. Computer Engineering and Applications, 2010, 46(9): 180-182.
[11]	YANG Yuan，ZHANG Shen，SHEN Wei-wei，WANG Wei. Character segmentation and recognition system for image of multiple bookmarks [J]. Computer Engineering and Applications, 2010, 46(34): 181-183.
[12]	CHEN Kai¹,ZENG Qing-ye¹,PANG Yi-jie²,WANG Jing²,TANG Ping¹. Character recognition of comment information in aerial photos [J]. Computer Engineering and Applications, 2009, 45(7): 235-237.
[13]	ZHAO Ya-qin. Novel effective method of topic caption text in news video [J]. Computer Engineering and Applications, 2009, 45(33): 175-178.
[14]	MA Yang-tao,TAO Zhi-sui,ZHANG Jin-huan,YANG Xiao-wei. Optimal model for handwritten Chinese character segmention [J]. Computer Engineering and Applications, 2008, 44(2): 227-229.
[15]	LI Yong-hua¹,WANG Ke-jun¹,SHANGGUAN Wei¹,TANG Li-qun². Baseline structure analysis and recognition algorithm research of mathematical formula [J]. Computer Engineering and Applications, 2008, 44(16): 18-22.

Characters Segmentation Method of Historical Documents Mixed in Korean and Chinese

朝汉混排古籍的文字切分方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics