深度学习在场景文字识别技术中的应用综述

doi:10.3778/j.issn.1002-8331.2106-0411

摘要/Abstract

摘要： 随着深度学习技术在计算机视觉领域的发展，场景文本检测与文字识别技术也有了突破性的进展。受到自然场景下极端光照、遮挡、模糊、多方向多尺度等情况的影响，无约束的场景文本检测与识别仍然面临着巨大的挑战。从深度学习的角度对场景文本检测和文字识别技术进行深入研究，总结出在文本检测技术中将基于分割的方法与回归的方法优势相结合，可以解决小文本区域的召回率较低的问题，同时适应多尺度文本；在文本识别方法中将CTC机制与Attention机制相结合，可以相互监督以提升识别性能，降低长文本识别的出错率。

关键词: 深度学习, 计算机视觉, 自然场景, 文本检测, 文字识别

Abstract: With the development of deep learning technology in the field of computer vision, there are breakthroughs in scene text detection and text recognition technology. Affected by extreme lighting, occlusion, blur, multi-direction and multi-scale in natural scenes, there are still huge challenges facing unconstrained scene text detection and recognition. In this paper, the scene text detection and text recognition technology are studied deeply from the perspective of deep learning, and the method and regression based on segmentation in the text detection technology are summarized. The combination of the advantages of the method can solve the problem of low recall rate of small text areas, while adapting to multi-scale text. Through the combination of the CTC mechanism and the Attention mechanism in the text recognition method, mutual supervision can be achieved, the recognition performance is improved, and the error rate of long text recognition is reduced.

Key words: deep learning, computer vision, natural scene, text detection, text recognition

刘艳菊, 伊鑫海, 李炎阁, 张惠玉, 刘彦忠. 深度学习在场景文字识别技术中的应用综述[J]. 计算机工程与应用, 2022, 58(4): 52-63.

LIU Yanju, YI Xinhai, LI Yange, ZHANG Huiyu, LIU Yanzhong. Application of Scene Text Recognition Technology Based on Deep Learning：A Survey[J]. Computer Engineering and Applications, 2022, 58(4): 52-63.

参考文献

[1] 王润民，桑农，丁丁，等.自然场景图像中的文本检测综述[J].自动化学报，2018，44（12）：2113-2141.
WANG R M，SANG N，DING D，et al.Text detection in natural scene image：a survey[J].Acta Automatica Sinica，2018，44（12）：2113-2141.
[2] RADWAN M A，KHALIL M I，ABBAS H M.Neural networks pipeline for offline machine printed Arabic OCR[J].Neural Processing Letters，2018，48（2）：769-787.
[3] 王德青，吾守尔·斯拉木，许苗苗.场景文字识别技术研究综述[J].计算机工程与应用，2020，56（18）：1-15.
WANG D Q，Wushouer[·]Silamu，XU M M.Review of research on scene text recognition technology[J].Computer Engineering and Applications，2020，56（18）：1-15.
[4] 姜维，张重生，殷绪成.基于深度学习的场景文字检测综述[J].电子学报，2019，47（5）：1152-1161.
JIANG W，ZHANG C S，YIN X C.Deep learning based scene text detection：a survey[J].Acta Electronica Sinica，2019，47（5）：1152-1161.
[5] 金连文，钟卓耀，杨钊，等.深度学习在手写汉字识别中的应用综述[J].自动化学报，2016，42（8）：1125-1141.
JIN L W，ZHONG Z Y，YANG Z，et al.Applications of deep learning for handwritten Chinese character recognition：a review[J].Acta Automatica Sinica，2016，42（8）：1125-1141.
[6] GUPTA A，VEDALDI A，ZISSERMAN A.Synthetic data for text localisation in natural images[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：2315-2324.
[7] LONG J，SHELHAMER E，DARRELL T.Fully convolutional networks for semantic segmentation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition，2015：3431-3440.
[8] REDMON J，DIVVALA S，GIRSHICK R，et al.You only look once：unified，real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：779-788.
[9] LIU Y，JIN L.Deep matching prior network：toward tighter multi-oriented text detection[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：1962-1969.
[10] LIU Y，ZHANG S，JIN L，et al.Omnidirectional scene text detection with sequential-free box discretization[J].arXiv：1906.02371，2019.
[11] LIAO M，SHI B，BAI X，et al.TextBoxes：a fast text detector with a single deep neural network[C]//31st 2017 AAAI Conference on Artificial Intelligence，2017.
[12] LIU W，ANGUELOV D，ERHAN D，et al.SSD：single shot multibox detector[C]//14th European Conference on Computer Vision.Cham：Springer，2016：21-37.
[13] LIAO M，SHI B，BAI X.TextBoxes++：a single-shot oriented scene text detector[J].IEEE Transactions on Image Processing，2018，27（8）：3676-3690.
[14] ZHOU X，YAO C，WEN H，et al.EAST：an efficient and accurate scene text detector[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：5551-5560.
[15] WANG Y，XIE H，ZHA Z，et al.R-Net：a relationship network for efficient and accurate scene text detection[J].IEEE Transactions on Multimedia，2020，23：1316-1329.
[16] SHI B，BAI X，BELONGIE S.Detecting oriented text in natural images by linking segments[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：2550-2558.
[17] TANG J，YANG Z，WANG Y，et al.SegLink++：detecting dense and arbitrary-shaped scene text by instance-aware component grouping[J].Pattern Recognition，2019，96：106954.
[18] MA C，SUN L，ZHONG Z，et al.ReLaText：exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks[J].Pattern Recognition，2021，111：107684.
[19] XIAO L，ZHOU P，XU K，et al.Multi-directional scene text detection based on improved YOLOv3[J].Sensors，2021，21（14）：4870.
[20] LYU P，LIAO M，YAO C，et al.Mask TextSpotter：an end-to-end trainable neural network for spotting text with arbitrary shapes[C]//15th European Conference on Computer Vision，2018：67-83.
[21] GIRSHICK R.Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision，2015：1440-1448.
[22] ZHANG C，LIANG B，HUANG Z，et al.Look more than once：an accurate detector for text of arbitrary shapes[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：10552-10561.
[23] XUE C，LU S，ZHANG W.MSR：multi-scale shape regression for scene text detection[J].arXiv：1901.02596，2019.
[24] LI Y，QI H Z，DAI J F，et al.Fully convolutional instance-aware semantic segmentation[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，2017：4438.
[25] LONG S，RUAN J，ZHANG W，et al.TextSnake：a flexible representation for detecting text of arbitrary shapes[C]//15th European Conference on Computer Vision，2018：20-36.
[26] XIE Z，HUANG Y，ZHU Y，et al.Aggregation cross-entropy for sequence recognition[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：6538-6547.
[27] WANG W，XIE E，LI X，et al.Shape robust text detection with progressive scale expansion network[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：9336-9345.
[28] WANG W，XIE E，SONG X，et al.Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]//2019 IEEE/CVF International Conference on Computer Vision，2019：8440-8449.
[29] ZHU Y，DU J.TextMountain：accurate scene text detection via instance segmentation[J].Pattern Recognition，2021，110：107336.
[30] LIAO M，WAN Z，YAO C，et al.Real-time scene text detection with differentiable binarization[C]//34th AAAI Conference on Artificial Intelligence，2020：11474-11481.
[31] LIU J，LIU X，SHENG J，et al.Pyramid mask text detector[J].arXiv：1903.11800，2019.
[32] HE K，GKIOXARI G，DOLLáR P，et al.Mask R-CNN[C]//2017 IEEE International Conference on Computer Vision，2017：2961-2969.
[33] XIE E，ZANG Y，SHAO S，et al.Scene text detection with supervised pyramid context network[C]//33rd AAAI Conference on Artificial Intelligence，2019：9038-9045.
[34] WANG Y，XIE H，ZHA Z J，et al.ContourNet：taking a further step toward accurate arbitrary-shaped scene text detection[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：11753-11762.
[35] 颜建强.图像视频复杂场景中文字检测识别方法研究[D].西安：西安电子科技大学，2014.
YAN J Q.Text detection and recognition in complex scene of image and video[D].Xi’an：Xidian University，2014.
[36] 何树有.自然场景中文字识别关键技术研究[D].大连：大连理工大学，2017.
HE S Y.Research on key technologies of character recognition in natural image[D].Dalian：Dalian University of Technology，2017.
[37] 王建新，王子亚，田萱.基于深度学习的自然场景文本检测与识别综述[J].软件学报，2020，31（5）：1465-1496.
WANG J X，WANG Z Y，TIAN X.Review of natural scene text detection and recognition based on deep learning[J].Journal of Software，2020，31（5）：1465-1496.
[38] GRAVES A，FERNáNDEZ S，GOMEZ F，et al.Connectionist temporal classification：labelling unsegmented sequence data with recurrent neural networks[C]//23rd International Conference on Machine Learning，2006：369-376.
[39] HE P，HUANG W，QIAO Y，et al.Reading scene text in deep convolutional sequences[C]//30th AAAI Conference on Artificial Intelligence，2016.
[40] GOODFELLOW I，WARDE-FARLEY D，MIRZA M，et al.Maxout networks[C]//30th International Conference on Machine Learning，2013：1319-1327.
[41] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[42] SHI B，BAI X，YAO C.An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2016，39（11）：2298-2304.
[43] JADERBERG M，SIMONYAN K，VEDALDI A，et al.Deep structured output learning for unconstrained text recognition[C]//3rd International Conference on Learning Representations，San Diego，May 7-9，2015.
[44] JADERBERG M，SIMONYAN K，ZISSERMAN A.Spatial transformer networks[C]//Advances in Neural Information Processing Systems，2015：2017-2025.
[45] BOOKSTEIN F L.Principal warps：thin-plate splines and the decomposition of deformations[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2002，11（6）：567-585.
[46] SHI B，YANG M，WANG X，et al.ASTER：an attentional scene text recognizer with flexible rectification[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2018，41（9）：2035-2048.
[47] GRAVES A，LIWICKI M，FERNáNDEZ S，et al.A novel connectionist system for unconstrained handwriting recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2008，31（5）：855-868.
[48] LUO C，JIN L，SUN Z.MORAN：a multi-object rectified attention network for scene text recognition[J].Pattern Recognition，2019，90：109-118.
[49] LIN Q，LUO C，JIN L，et al.STAN：a sequential transformation attention-based network for scene text recognition[J].Pattern Recognition，2021，111：107692.
[50] CHENG Z，BAI F，XU Y，et al.Focusing attention：towards accurate text recognition in natural images[C]//2017 IEEE International Conference on Computer Vision，2017：5076-5084.
[51] WANG T，ZHU Y，JIN L，et al.Decoupled attention network for text recognition[C]//34th AAAI Conference on Artificial Intelligence，2020：12216-12224.
[52] LU N，YU W，QI X，et al.MASTER：multi-aspect non-local network for scene text recognition[J].Pattern Recognition，2021，117：107980.
[53] WANG C，LIU C L.Multi-branch guided attention network for irregular text recognition[J].Neurocomputing，2021，425：278-289.
[54] LITMAN R，ANSCHEL O，TSIPER S，et al.SCATTER：selective context attentional scene text recognizer[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：11962-11972.
[55] HU W，CAI X，HOU J，et al.GTC：guided training of CTC towards efficient and accurate scene text recognition[C]//2020 AAAI Conference on Artificial Intelligence，2020：11005-11012.
[56] QIAO Z，ZHOU Y，YANG D，et al.SEED：semantics enhanced encoder-decoder framework for scene text recognition[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：13528-13537.
[57] SUN Y，LIU J，LIU W，et al.Chinese street view text：large-scale Chinese text reading with partially supervised learning[C]//2019 IEEE/CVF International Conference on Computer Vision，2019：9086-9095.
[58] ZHANG Y，NIE S，LIU W，et al.Sequence-to-sequence domain adaptation network for robust text image recognition[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：2740-2749.
[59] 刘翦.开放环境下目标检测与识别算法研究——以极端光照环境下车牌识别为例[D].天津：天津理工大学，2020.
LIU J.Research on target detection and recognition algorithm in open environment-take license plate recognition in extreme lighting environment as an example[D].Tianjin：Tianjin University of Technology，2020.
[60] LECUN Y，BOTTOU L，BENGIO Y，et al.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE，1998，86（11）：2278-2324.
[61] LIU Y，CHEN H，SHEN C，et al.ABCnet：real-time scene text spotting with adaptive Bezier-curve network[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：9809-9818.
[62] LIAO M，PANG G，HHUANG J，et al.Mask TextSpotter v3：segmentation proposal network for robust scene text spotting[C]//16th European Conference on Computer Vision，Glasgow，Aug 23-28，2020：706-722.
[63] LIAO M，LYU P，HE M，et al.Mask TextSpotter：an end-to-end trainable neural network for spotting text with arbitrary shapes[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，43（2）：532-548.
[64] FENG W，HE W，YIN F，et al.TextDragon：an end-to-end framework for arbitrary shaped text spotting[C]//2019 IEEE/CVF International Conference on Computer Vision，2019：9076-9085.
[65] REN S，HE K，GRISHICK R，et al.Faster R-CNN：towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems，2015，28：91-99.
[66] KARATZAS D，SHAFAIT F，UCHIDA S，et al.ICDAR 2013 robust reading competition[C]//2013 12th International Conference on Document Analysis and Recognition，2013：1484-1493.
[67] KARATZAS D，GOMEZ-BIGORDA L，NICOLAOU A，et al.ICDAR2015 competition on robust reading[C]//2015 13th International Conference on Document Analysis and Recognition.Piscataway：IEEE，2015：1156-1160.
[68] WANG K，BABENKO B，BELONGIE S.End-to-end scene text recognition[C]//2011 International Conference on Computer Vision.Piscataway：IEEE，2011：1457-1464.
[69] LEE S H，CHO M S，JUNG K，et al.Scene text extraction with edge constraint and text collinearity[C]//2010 20th International Conference on Pattern Recognition，2010：3983-3986.
[70] YAO C，BAI X，LIU W，et al.Detecting texts of arbitrary orientations in natural images[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition，2012：1083-1090.
[71] YI C，TIAN Y L.Text string detection from natural scenes by structure-based partition and grouping[J].IEEE Transactions on Image Processing，2011，20（9）：2594-2605.
[72] VEIT A，MATERA T，NEUMANN L，et al.COCO-text：dataset and benchmark for text detection and recognition in natural images[J].arXiv：1601.07140，2016.
[73] LIU Y L，JIN L W，ZHANG S T，et al.Detecting curve text in the wild：new dataset and new solution[J].arXiv：1712.02170，2017.
[74] NAYEF N，YIN F，BIZID I，et al.ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification—RRC-MLT[C]//2017 14th IAPR International Conference on Document Analysis and Recognition，2017：1454-1459.
[75] NAYEF N，PATEL Y，BUSTA M，et al.ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019[C]//2019 International Conference on Document Analysis and Recognition，2019：1582-1587.
[76] HASSAN H，El-MAHDY A，HUSSEIN M E.Arabic scene text recognition in the deep learning era：analysis on a novel dataset[J].IEEE Access，2021，9：107046-107058.
[77] SUN Y，NI Z，CHNG C K，et al.ICDAR 2019 competition on large-scale street view text with partial labeling—RRC-LSVT[C]//2019 International Conference on Document Analysis and Recognition，2019：1557-1562.
[78] ZHANG R，ZHOU Y，JIANG Q，et al.ICDAR 2019 robust reading challenge on reading Chinese text on signboard[C]//2019 International Conference on Document Analysis and Recognition，2019：1577-1581.
[79] YUAN T L，ZHU Z，XU K，et al.A large Chinese text dataset in the wild[J].Journal of Computer Science and Technology，2019，34（3）：509-521.
[80] ZHANG C，DING W，PENG G，et al.Street view text recognition with deep learning for urban scene understanding in intelligent transportation systems[J].IEEE Transactions on Intelligent Transportation Systems，2021，22（7）：4727-4743.