Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (11): 135-141.DOI: 10.3778/j.issn.1002-8331.1902-0119

Previous Articles     Next Articles

Characters Segmentation Method of Historical Documents Mixed in Korean and Chinese

LIU Xingchen, JIN Xiaofeng   

  1. Intelligent Information Processing Laboratory, Department of Computer Science & Technology, Yanbian University, Yanji, Jilin 133002, China
  • Online:2020-06-01 Published:2020-06-01



  1. 延边大学 计算机科学与技术学科智能信息处理研究室,吉林 延吉 133002


To solve the character segmentation problem for Korean historical document digitization, the paper proposes an effective character segmentation algorithm. In the algorithm, it first divides the document according to columns based on connected component rule and projection method which can handle the scenario of discontinuity separator lines, skew or joined characters contained in Korean historical documents. And then, the characters are segmented by employing the operation of deletion, merging and splitting on the connected components. It uses a multi-step technique which makes full use of the characteristics of different character sizes, horizontal and vertical mixed arrangement in the text image to complete this segmentation. For connected characters, an improved drop fall algorithm is adopted to get effective segmentation. The experimental results show that the proposed algorithm can effectively accomplish the segmentation of Korean history documents which have multi-language, different character size and complex arrangement. In the dataset, the accuracy of column segmentation and character segmentation can achieve 97.69% and 87.79% separately.

Key words: ancient books digitalization, Korean historical documents, column segmentation, character segmentation



关键词: 古籍数字化, 朝鲜语古籍, 列切分, 字符切分