计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (6): 128-133.DOI: 10.3778/j.issn.1002-8331.2009-0459

• 模式识别与人工智能 • 上一篇    下一篇

基于Encoder-Decoder框架的双监督机制自然场景文本识别

陈佐瓒,徐兵,丁小军,甘井中   

  1. 1.玉林师范学院 计算机科学与工程学院,广西 玉林 537000
    2.南京师范大学 地理科学学院,南京 210023
    3.中南大学 计算机学院,长沙 410083
  • 出版日期:2022-03-15 发布日期:2022-03-15

Natural Scene Text Recognition Based on Encoder-Decoder Framework with Dual Supervision Mechanism

CHEN Zuozan, XU Bing, DING Xiaojun, GAN Jingzhong   

  1. 1.School of Computer Science and Engineering, Yulin Normal University, Yulin, Guangxi 537000, China
    2.School of Geography, Nanjing Normal University, Nanjing 210023, China
    3.School of Computer Science and Engineering, Central South University, Chansha 410083, China
  • Online:2022-03-15 Published:2022-03-15

摘要: 针对复杂的自然场景下文本较难识别的情况,特别是对不规则文本的识别仍很具挑战性,提出了一种具有注意机制的双监督网络。考虑到在现实世界中阅读单词时通常不会在脑海中纠正他,而是调整焦点和视觉范围。在特征提取过程中利用几何结构可调的可变形卷积层结合文本注意模块,强制模型专注于文本区域,无需对不规则的文本进行位置纠正。该文的总体框架有两个分支监督,一个监督分支来自上下文级别建模,另一个监督分支来自一个额外的监督增强分支,旨在处理角色级别的不明确的语义信息。这两个监督可以相互促进,并产生更好的性能。所提出的方法可以识别任意长度的文本,并且不需要任何预定义的词典。实验表明,与对比方法相比,提出的方法在场景文本基准数据集上的识别精度有明显提升。

关键词: 场景文本识别, 注意力机制, 双监督

Abstract: Aiming at the situation that text is difficult to recognize in complex natural scenes, especially the recognition of irregular text is still very challenging, a dual-supervised network with attention mechanism is proposed. Considering that when reading a word in the real world, do not usually correct him in minds, but adjust the focus and visual range. In the feature extraction process, a deformable convolutional layer with adjustable geometric structure combined with a text attention module is used to force the model to focus on the text area without the need to correct the position of irregular text. The overall framework of this paper has two branch supervisions, one supervision branch comes from context-level modeling, and the other supervision branch comes from an additional supervision enhancement branch, which aims to deal with ambiguous semantic information at the role level. These two supervisions can promote each other and produce better performance. The proposed method can recognize text of any length and does not require any predefined dictionary. Experiments show that compared with the comparison method, the proposed method has a significant improvement in the recognition accuracy of the scene text benchmark data set.

Key words: scene text recognition, attention mechanism, dual supervision