多尺度池化和双向特征融合的场景文本检测

doi:10.3778/j.issn.1002-8331.2210-0077

摘要/Abstract

摘要： 针对自然场景中文字背景复杂多样、形态大小各异的问题，提出了一种新的基于分割的场景文本检测网络。通过构建多尺度池化和双向特征融合两个模块来提升网络性能。根据文本实例的特点，多尺度池化模块使用不同长宽比窗口的空间池来捕获不同距离上文本信息的依赖关系，指导网络得到更加准确的分割结果。双向特征融合模块构建了两条不同方向的融合路径，以更好地利用主干网络的不同尺度特征，提升网络对不同尺度文本的检测性能。实验结果证明了所提方法的有效性，在ICDAR2015、MSRA-TD500和Total-Text这三个公开数据集上，分别取得了87.7%、86.7%和85.5%的F-measure值。

关键词: 文本检测, 图像分割, 多尺度池化, 双向特征融合

Abstract: Text has complex background, with different shapes and sizes in the natural scene. To solve this problem, a new scene text detection network based on segmentation is proposed. The network performance is improved by building two modules：multi-scale pooling and bidirectional feature fusion. According to the characteristics of text instances, the multi-scale pooling module uses spatial pooling with different aspect ratios window to capture the dependency of text information at different distances, which guides the network to obtain more accurate segmentation results. The bidirectional feature fusion module constructs two fusion paths in different directions to better utilize the different scale features of the backbone network and improve the network’s detection performance for texts of different scales. The experimental results prove the effectiveness of the proposed method. On the ICDAR2015, MSRA-TD500 and Total-Text three open data sets, 87.7%, 86.7% and 85.5% F-measure values are obtained respectively.

Key words: text detection, image segmentation, multi-scale pooling, bidirectional feature fusion

魏哲亮, 李岳阳, 罗海驰. 多尺度池化和双向特征融合的场景文本检测[J]. 计算机工程与应用, 2024, 60(2): 154-161.

WEI Zheliang, LI Yueyang, LUO Haichi. Scene Text Detection Based on Multi-Scale Pooling and Bidirectional Feature Fusion[J]. Computer Engineering and Applications, 2024, 60(2): 154-161.

参考文献

[1] 李益红, 陈袁宇. 深度学习场景文本检测方法综述[J]. 计算机工程与应用, 2021, 57(6): 42-48.
LI Y H, CHEN Y Y. Review on deep learning based scene text detection[J]. Computer Engineering and Applications, 2021, 57(6): 42-48.
[2] LIAO M, SHI B, BAI X, et al. TextBoxes: a fast text detector with a single deep neural network[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[3] WEI L, DRAGOMIR A, DUMITRU E, et al. SSD: single shot multibox detector[C]//Proceedings of the European Conference on Computer Vision, 2016: 21-37.
[4] ZHOU X, YAO C, WEN H, et al. EAST: an efficient and accurate scene text detector[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 5551-5560.
[5] DENG D, LIU H, LI X, et al. Pixelink: detecting scene text via instance segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[6] WANG W, XIE E, SONG X, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[7] LIAO M, WAN Z, YAO C, et al. Real-time scene text detection with differentiable binarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11474-11481.
[8] ZHOU B, KHOSLA A, LAPEDRIZA A, et al. Object detectors emerge in deep scene CNNs[J]. arXiv:1412.6856,2014.
[9] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 2117-2125.
[10] ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2881-2890.
[11] LIAO M, ZOU Z, WAN Z, et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 919-931.
[12] TAN M, PANG R, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 10781-10790.
[13] LIU S, QI L, QIN H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018: 8759-8768.
[14] GUPTA A, VEDALDI A, ZISSERMAN A. Synthetic data for text localisation in natural images[C]//Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, 2016: 2315-2324.
[15] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading[C]//Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015: 1156-1160.
[16] CHNG C K, CHAN C S. Total-Text: a comprehensive dataset for scene text detection and recognition[C]//Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017: 935-942.
[17] YAO C, BAI X, LIU W, et al. Detecting texts of arbitrary orientations in natural images[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: 1083-1090.
[18] YAO C, BAI X, LIU W. A unified framework for multioriented text detection and recognition[J]. IEEE Trans Image Process, 2014, 23(11): 4737-4749.
[19] XIE E, ZANG Y, SHAO S, et al. Scene text detection with supervised pyramid context network[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 9038-9045.
[20] LIU Z, LIN G, YANG S, et al. Learning Markov clustering networks for scene text detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
[21] LIAO M, ZHU Z, SHI B, et al. Rotation-sensitive regression for oriented scene text detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 5909-5918.
[22] LONG S, RUAN J, ZHANG W, et al. TextSnake: a flexible representation for detecting text of arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision, 2018: 20-36.
[23] TIAN Z, SHU M, LYU P, et al. Learning shape-aware embedding for scene text detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 4234-4243.
[24] BAEK Y, LEE B, HAN D, et al. Character region awareness for text detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9365-9374.
[25] ZHANG C, LIANG B, HUANG Z, et al. Look more than once: an accurate detector for text of arbitrary shapes[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 10552-10561.
[26] WANG W, XIE E, LI X, et al. Shape robust text detection with progressive scale expansion network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9336-9345.
[27] JIANG X F, XU S G, ZHANG S Q, et al. Arbitrary-shaped text detection with adaptive text region representation[J]. IEEE Access, 2020, 8: 102106-102118.
[28] XU Y, WANG Y, ZHOU W, et al. TextField: learning a deep direction field for irregular scene text detection[J]. IEEE Trans Image Process, 2019, 28(11): 5566-5579.
[29] LIU Z, LIN G, YANG S, et al. Towards robust curve text detection with conditional spatial expansion[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 7269-7278.