计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (11): 135-141.DOI: 10.3778/j.issn.1002-8331.1902-0119

• 模式识别与人工智能 • 上一篇    下一篇

朝汉混排古籍的文字切分方法

刘星辰,金小峰   

  1. 延边大学 计算机科学与技术学科智能信息处理研究室,吉林 延吉 133002
  • 出版日期:2020-06-01 发布日期:2020-06-01

Characters Segmentation Method of Historical Documents Mixed in Korean and Chinese

LIU Xingchen, JIN Xiaofeng   

  1. Intelligent Information Processing Laboratory, Department of Computer Science & Technology, Yanbian University, Yanji, Jilin 133002, China
  • Online:2020-06-01 Published:2020-06-01

摘要:

为解决朝鲜语古籍数字化中朝汉文种混排字符切分困难的问题,提出一种朝鲜语古籍图像的文字切分算法。针对古籍列与列之间存在不连续间隔线、倾斜或者粘连等问题,提出一种基于连通域投影的列切分方法。利用连通域的删除、合并、拆分等操作对文字进行切分。使用一种多步切分法完成了具有文字大小不一,横向、纵向混合排版特点图像的字符切分工作。对于粘连字,采用改进的滴水算法进行有效切分。实验结果表明所提出的算法能够很好地完成朝、汉文种混排,文字大小不一,排版情况复杂的朝鲜语古籍图像的文字切分工作。该算法的列切分准确率为97.69%,字切分准确率为87.79%。

关键词: 古籍数字化, 朝鲜语古籍, 列切分, 字符切分

Abstract:

To solve the character segmentation problem for Korean historical document digitization, the paper proposes an effective character segmentation algorithm. In the algorithm, it first divides the document according to columns based on connected component rule and projection method which can handle the scenario of discontinuity separator lines, skew or joined characters contained in Korean historical documents. And then, the characters are segmented by employing the operation of deletion, merging and splitting on the connected components. It uses a multi-step technique which makes full use of the characteristics of different character sizes, horizontal and vertical mixed arrangement in the text image to complete this segmentation. For connected characters, an improved drop fall algorithm is adopted to get effective segmentation. The experimental results show that the proposed algorithm can effectively accomplish the segmentation of Korean history documents which have multi-language, different character size and complex arrangement. In the dataset, the accuracy of column segmentation and character segmentation can achieve 97.69% and 87.79% separately.

Key words: ancient books digitalization, Korean historical documents, column segmentation, character segmentation