Natural Scene Text Recognition Based on Encoder-Decoder Framework with Dual Supervision Mechanism

doi:10.3778/j.issn.1002-8331.2009-0459

Abstract

Abstract: Aiming at the situation that text is difficult to recognize in complex natural scenes, especially the recognition of irregular text is still very challenging, a dual-supervised network with attention mechanism is proposed. Considering that when reading a word in the real world, do not usually correct him in minds, but adjust the focus and visual range. In the feature extraction process, a deformable convolutional layer with adjustable geometric structure combined with a text attention module is used to force the model to focus on the text area without the need to correct the position of irregular text. The overall framework of this paper has two branch supervisions, one supervision branch comes from context-level modeling, and the other supervision branch comes from an additional supervision enhancement branch, which aims to deal with ambiguous semantic information at the role level. These two supervisions can promote each other and produce better performance. The proposed method can recognize text of any length and does not require any predefined dictionary. Experiments show that compared with the comparison method, the proposed method has a significant improvement in the recognition accuracy of the scene text benchmark data set.

Key words: scene text recognition, attention mechanism, dual supervision

摘要： 针对复杂的自然场景下文本较难识别的情况，特别是对不规则文本的识别仍很具挑战性，提出了一种具有注意机制的双监督网络。考虑到在现实世界中阅读单词时通常不会在脑海中纠正他，而是调整焦点和视觉范围。在特征提取过程中利用几何结构可调的可变形卷积层结合文本注意模块，强制模型专注于文本区域，无需对不规则的文本进行位置纠正。该文的总体框架有两个分支监督，一个监督分支来自上下文级别建模，另一个监督分支来自一个额外的监督增强分支，旨在处理角色级别的不明确的语义信息。这两个监督可以相互促进，并产生更好的性能。所提出的方法可以识别任意长度的文本，并且不需要任何预定义的词典。实验表明，与对比方法相比，提出的方法在场景文本基准数据集上的识别精度有明显提升。

关键词: 场景文本识别, 注意力机制, 双监督

CHEN Zuozan, XU Bing, DING Xiaojun, GAN Jingzhong. Natural Scene Text Recognition Based on Encoder-Decoder Framework with Dual Supervision Mechanism[J]. Computer Engineering and Applications, 2022, 58(6): 128-133.

陈佐瓒, 徐兵, 丁小军, 甘井中. 基于Encoder-Decoder框架的双监督机制自然场景文本识别[J]. 计算机工程与应用, 2022, 58(6): 128-133.

References

[1] YAO C，BAI X，SHI B，et al.Strokelets：a learned multi-scale representation for scene text recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2014：4042-4049.
[2] ZHAN F，LU S.Esir：end-to-end scene text recognition via iterative image rectification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：2059-2068.
[3] OPPL S，STARY C，VOGL S.Recognition of paper-based conceptual models captured under uncontrolled conditions[J].IEEE Transactions on Human-Machine Systems，2016，47（2）：206-220.
[4] ALSHARIF O，PINEAU J.End-to-end text recognition with hybrid HMM maxout models[J].arXiv：1310.1811，2013.
[5] LEI Z，ZHAO S，SONG H，et al.Scene text recognition using residual convolutional recurrent neural network[J].Machine Vision and Applications，2018，29（5）：1-11.
[6] JADERBERG M，VEDALDI A，ZISSERMAN A.Deep features for text spotting[C]//European Conference on Computer Vision.Cham：Springer，2014：512-528.
[7] GOEL V，MISHRA A，ALAHARI K，et al.Whole is greater than sum of parts：recognizing scene text words[C]//2013 12th International Conference on Document Analysis and Recognition，2013：398-402.
[8] RODRIGUEZ-SERRANO J A，PERRONNIN F，MEYLAN F.Label embedding for text recognition[C]//British Machine Vision Conference，2013：1-12.
[9] GOODFELLOW I J，BULATOV Y，IBARZ J，et al.Multi-digit number recognition from street view imagery using deep convolutional neural networks[J].arXiv：1312.6082，2013.
[10] JADERBERG M，SIMONYAN K，VEDALDI A，et al.Synthetic data and artificial neural networks for natural scene text recognition[J].arXiv：1406.2227，2014.
[11] SHI B，WANG X，LYU P，et al.Robust scene text recognition with automatic rectification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：4168-4176.
[12] LEE C Y，OSINDERO S.Recursive recurrent nets with attention modeling for ocr in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：2231-2239.
[13] SHI B，BAI X，YAO C.An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2016，39（11）：2298-2304.
[14] GRAVES A，FERNáNDEZ S，GOMEZ F，et al.Connectionist temporal classification：labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning，2006：369-376.
[15] BHUNIA A K，DAS A，BHUNIA A K，et al.Handwriting recognition in low-resource scripts using adversarial learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：4767-4776.
[16] BAHDANAU D，CHO K，BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv：1409.0473，2014.
[17] SUTSKEVER I，VINYALS O，LE Q V.Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems，2014：3104-3112.
[18] GEHRING J，AULI M，GRANGIER D，et al.Convolutional sequence to sequence learning[J].arXiv：1705.03122，2017.
[19] WANG K，BABENKO B，BELONGIE S.End-to-end scene text recognition[C]//2011 International Conference on Computer Vision，2011：1457-1464.
[20] MISHRA A，ALAHARI K，JAWAHAR C V.Top-down and bottom-up cues for scene text recognition[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition，2012：2687-2694.
[21] JADERBERG M，SIMONYAN K，VEDALDI A，et al.Synthetic data and artificial neural networks for natural scene text recognition[J].arXiv：1406.2227，2014.
[22] THORNE J，VLACHOS A.Automated fact checking：task formulations，methods and future directions[J].arXiv：1806.
07687，2018.
[23] KARATZAS D，SHAFAIT F，UCHIDA S，et al.ICDAR 2013 robust reading competition[C]//2013 12th International Conference on Document Analysis and Recognition，2013：1484-1493.
[24] THOMAS H，QI C R，DESCHAUD J E，et al.Kpconv：flexible and deformable convolution for point clouds[C]//Proceedings of the IEEE International Conference on Computer Vision，2019：6411-6420.