用于场景文本检测的非对称迭代细化预测网络

doi:10.3778/j.issn.1002-8331.2310-0243

摘要/Abstract

摘要： 场景文本检测是图像处理领域的基础性研究工作，具有广泛的应用价值。DBNet作为该领域具有代表性的算法，重构文本实例的后处理过程过于简单，对纵横比显著变化的文本容易误检以及对小文本容易漏检。为解决以上问题，设计并提出用于场景文本检测的非对称迭代细化预测网络AIRPNet。模型基于ResNet50特征提取网络，将常规卷积替换为可变形卷积以适应不规则文本特征，并调整block堆叠数使得各层携带的特征更加合理。采用RFP的递归思想更充分地融合多层特征，设计非对称迭代细化预测模块构建更为准确的概率图，可微分二值化后处理重构文本实例边界。针对非对称迭代细化预测模块，设计多种结构进行探究。为评估提出模型的有效性，在三个数据集上与最先进的主流模型进行对比，在ICDAR2015、MSRA-TD500和CTW1500数据集中，分别取得88.7%、88.4%和84.9%的F-measure，实现或接近SOTA性能。实验结果表明，提出模型能够捕获较为准确的概率图，从而构建较为完整的文本边界框。

关键词: 文本检测, 递归金字塔, 非对称卷积, 迭代细化预测, 可微分二值化

Abstract: Scene text detection is a fundamental research work in the field of image processing, which has a wide range of application value. As a representative algorithm in this field, DBNet has a problem that the post-processing of reconstructed text instances is too simple, and it is easy to misdetect the text with a significant change in aspect ratio as well as easy to miss the detection of small text. In order to solve the above problems, AIRPNet, an asymmetric iterative refinement prediction network for scene text detection, is designed and proposed. The model is based on ResNet50 feature extraction network, which replaces the regular convolution with deformable convolution to adapt to the irregular text features and adjusts the number of block stacks to make the features carried by each layer more reasonable. The recursive idea of RFP is used to integrate the multi-layer features more fully, and the asymmetric iterative refinement prediction module is designed to construct more accurate probability maps, and the text instance boundaries are reconstructed by differentiable binarization post-processing. For the asymmetric iterative refinement prediction module, various structures are designed for exploration. To evaluate the effectiveness of the proposed model, it is compared with the state-of-the-art mainstream models on three datasets, and 88.7%, 88.4%, and 84.9% of F-measure is achieved in the ICDAR2015, MSRA-TD500, and CTW1500 datasets, respectively, realizing or approaching the SOTA performance. The experimental results show that the proposed model is able to capture more accurate probability maps and thus construct more complete text bounding boxes.

Key words: text detection, recursive pyramid, asymmetric convolution, iterative refinement prediction, differentiable binarization

连哲, 殷雁君, 米增, 智敏, 徐巧枝. 用于场景文本检测的非对称迭代细化预测网络[J]. 计算机工程与应用, 2025, 61(5): 250-260.

LIAN Zhe, YIN Yanjun, MI Zeng, ZHI Min, XU Qiaozhi. Asymmetric Iterative Refinement Prediction Network for Scene Text Detection[J]. Computer Engineering and Applications, 2025, 61(5): 250-260.

参考文献

[1] 连哲, 殷雁君, 云飞, 等. 基于深度学习的自然场景文本检测综述[J]. 计算机工程, 2024, 50(3): 16-27.
LIAN Z, YIN Y J, YUN F, et al. Review of natural scene text detection based on deep learning[J]. Computer Engineering, 2024, 50(3): 16-27.
[2] WANG J, CHEN Y, DONG Z, et al. Improved YOLOv5 network for real-time multi-scale traffic sign detection[J]. Neural Computing and Applications, 2022, 35: 7853-7865.
[3] HONG T, KIM D, JI M, et al. BROS: a pre-trained language model focusing on text and layout for better key information extraction from documents[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022: 10767-10775.
[4] BANU J F, MUNEESHWARI P, RAJA K, et al. Ontology based image retrieval by utilizing model annotations and content[C]//Proceedings of the 2022 12th International Conference on Cloud Computing, Data Science & Engineering, 2022: 300-305.
[5] NAIEMI F, GHODS V, KHALESI H. Scene text detection and recognition: a survey[J]. Multimedia Tools and Applications, 2022, 81(14): 20255-20290.
[6] 刘平, 姜永峰, 张良. 基于高阶图卷积推理网络的任意形状文本检测[J]. 计算机工程与应用, 2024, 60(1): 263-270.
LIU P, JIANG Y F, ZHANG L. Arbitrary shape text detection based on high-order graph convolution reaso-ning network[J]. Computer Engineering and Applications, 2024, 60(1): 263-270.
[7] RAISI Z, NAIEL M A, YOUNES G, et al. Transformer-based text detection in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 3162-3171.
[8] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[9] YANY X, YANG J, YAN J, et al. SCRDet: towards more robust detection for small, cluttered and rotated objects[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 8232-8241.
[10] ZHANG H, WU C, ZHANG Z, et al. ResNeSt: split-attention networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 2736-2746.
[11] LIU Z, MAO H, WU C Y, et al. A convNet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 11976-11986.
[12] TAN M, PANG R, LE Q V. EfficientDet: scalable and efficient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 10781-10790.
[13] QIAO S, CHEN L C, YUILLE A. DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 10213-10224.
[14] WANG W, XIE E, LI X, et al. Shape robust text detection with progressive scale expansion network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9336-9345.
[15] LIAO M, WAN Z, YAO C, et al. Real-time scene text detection with differentiable binarization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 11474-11481.
[16] LIN T Y, DOLLáR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2117-2125.
[17] LIU B, JIN J. Text detection based on bidirectional feature fusion and sa attention mechanism[C]//Proceedings of the 2022 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers, 2022: 912-915.
[18] ZHANG Q L, YANG Y B. SA-Net: Shuffle attention for deep convolutional neural networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2021: 2235-2239.
[19] GU S, ZHANG F. Applicable scene text detection based on semantic segmentation[J]. Journal of Physics: Conference Series, 2020: 12080.
[20] DING X, GUO Y, DING G, et al. ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 1911-1920.
[21] ZHANG X, DU B, WU Z, et al. LAANet: lightweight attention-guided asymmetric network for real-time semantic segmentation[J]. Neural Computing and Applications, 2022, 34(5): 3573-3587.
[22] LI X, MA X. Image semantic space segmentation based on cascaded feature fusion and asymmetric convolution module[J]. Wireless Communications and Mobile Computing, 2022, 4: 1-9.
[23] RUDER S. An overview of gradient descent optimization algorithms[J]. arXiv:1609.04747, 2016.
[24] LIN W, ZHANG Z, XUE X. An agile and efficient neural network based on knowledge distillation for scene text detection[J]. Wireless Communications and Mobile Computing, 2022: 1-9.
[25] ZHANG C, LIANG B, HUANG Z, et al. Look more than once: an accurate detector for text of arbitrary shapes[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 10552-10561.
[26] ZHU X, HU H, LIN S, et al. Deformable convNets v2: more deformable, better results[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9308-9316.
[27] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848.
[28] GUPTA A, VEDALDI A, ZISSERMAN A. Synthetic data for text localisation in natural images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2315-2324.
[29] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading[C]//Proceedings of the 2015 13th International Conference on Document Analysis and Recognition, 2015: 1156-1160.
[30] YAO C, BAI X, LIU W, et al. Detecting texts of arbitrary orientations in natural images[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: 1083-1090.
[31] LIU Y, JIN L, ZHANG S, et al. Curved scene text detection via transverse and longitudinal sequence connection[J]. Pattern Recognition, 2019, 90: 337-345.
[32] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.
[33] BAEK Y, LEE B, HAN D, et al. Character region awareness for text detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9365-9374.
[34] WANG W, XIE E, SONG X, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 8440-8449.
[35] WANG W, XIE E, LI X, et al. PAN++: towards efficient and accurate end-to-end spotting of arbitrarily-shaped text[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(9): 5349-5367.
[36] LIAO M, ZOU Z, WAN Z, et al. Real-time scene text detection with differentiable binarization and adaptive scale fusion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 919-931.
[37] CHEN Z, WANG J, WANG W, et al. FAST: faster arbitrarily-shaped text detector with minimalist kernel representation[J]. arXiv:2111.02394, 2021.
[38] LIN J, YAN Y, WANG H. A dual-path transformer network for scene text detection[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing , 2023: 1-5.
[39] WANG L, YAO X, SONG C. Text detection method based on HDBNet in natural scenes[J]. The Journal of Engineering, 2023, 2023(1): 12212.