Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (2): 29-33.

Previous Articles     Next Articles

Research on segmentation of historical Chinese books

NI Enzhi1, JIANG Minjun2, ZHOU Changle1   

  1. 1.Mind, Art and Computation Lab, School of Information Science and Technology, Xiamen University, Xiamen, Fujian 361005, China
    2.School of Computer Science and Information Engineering, Shanghai Institute of Technology, Shanghai 201418, China
  • Online:2013-01-15 Published:2013-01-16

古代汉字文献切分研究

倪恩志1,蒋旻隽2,周昌乐1   

  1. 1.厦门大学 信息科学与技术学院,艺术认知与计算实验室,福建 厦门 361005
    2.上海应用技术学院 计算机科学与信息工程学院,上海 201418

Abstract: In this paper, the methods of text line segmentation and character segmentation are proposed according to the characteristics of historical Chinese documents. The method of line segmentation analyzes stroke projection, and adopts a recursive segmentation algorithm based on various project thresholds and gap thresholds. This algorithm is robust in the cases of text line adhesion and skew, especially short text lines. The method of character segmentation has two steps. A rough segmentation is applied to get the approximate positions of segmentation. A fine segmentation based on the analysis of connected components and the judgment of adhesion points is carried out. This algorithm can extract the characters even though they overlap and connect each other. The experimental results show the methods have good performance and are suitable for the segmentation of historical Chinese documents.

Key words: document image processing, Chinese character segmentation, ancient books digitalization

摘要: 针对古代汉字文档的特点,提出了适合于古文档的列切分方法和字切分方法。提出的列切分方法直接对文档的笔画投影进行分析,采用一种基于分层投影过滤和变长间隙阈值的递归切分算法。该算法在列间隔较小、列与格线存在粘连、文档具有一定程度的倾斜的情况下,也能准确地抽取出列,尤其对短列的切分达到了较好的效果。提出的字切分方法分为两步,进行粗切分确定大致的切分位置,采用基于连通域分析与粘连点判断的方法做进一步的细切分。该算法对具有较多粘连和重叠汉字的列,也能较好地切分出完整的单字。实验结果表明,提出的方法用于古代汉字文档切分能够获得较好的效果。

关键词: 文档图像处理, 文档切分, 古籍数字化