Text Detection Algorithm Based on Multi-Scale Attention Feature Fusion

doi:10.3778/j.issn.1002-8331.2207-0410

Abstract

Abstract: Aiming at the low detection accuracy of small scale text and long text in text detection, a scene text detection algorithm based on multi-scale attention feature fusion is proposed. This method takes Mask R-CNN as the baseline model, selects Swin_Transformer as the backbone network to extract the bottom features. In the feature pyramid networks (FPN), the multi-scale attention heat maps are fused with the bottom features through lateral connection, so that different layers of the detector focus on specific scale targets, and the vertical feature sharing in FPN structure is realized by using the relationship between the adjacent attentional heat maps, avoiding the inconsistency of gradient calculation among different layers. Experimental results demonstrate that the accuracy, recall and F-value of this method reach 88.3%, 83.07% and 85.61% respectively on ICDAR2015 data set, and it performs well than the existing methods on CTW1500 and Total-Text curved text data set.

Key words: scene text detection, Mask R-CNN, Swin Transformer, attention mechanism, multi-scale feature fusion

摘要： 针对目前文本检测中小尺度文本和长文本检测精度低的问题，提出了一种基于多尺度注意力特征融合的场景文本检测算法。该方法以Mask R-CNN为基线模型，引入Swin_Transformer作为骨干网络提取底层特征。在特征金字塔（feature pyramid networks，FPN）中，通过将多尺度注意力热图与底层特征通过横向连接相融合，使检测器的不同层级专注于特定尺度的目标，并利用相邻层注意力热图之间的关系实现了FPN结构中的纵向特征共享，避免了不同层之间梯度计算的不一致性问题。实验结果表明：在ICDAR2015数据集上，该方法的准确率、召回率和F值分别达到了88.3%、83.07%和85.61%，在CTW1500和Total-Text弯曲文本数据集上相较现有方法均有良好表现。

关键词: 场景文本检测, Mask R-CNN, Swin Transformer, 注意力机制, 多尺度特征融合

SHE Xiangyang, LIU Zhe, DONG Lihong. Text Detection Algorithm Based on Multi-Scale Attention Feature Fusion[J]. Computer Engineering and Applications, 2024, 60(1): 198-206.

厍向阳, 刘哲, 董立红. 基于多尺度注意力特征融合的场景文本检测[J]. 计算机工程与应用, 2024, 60(1): 198-206.

References

[1] TIAN Z, HUANG W, HE T, et al. Detecting text in natural image with connectionist text proposal network[C]//European Conference on Computer Vision. Cham：Springer, 2016: 56-72.
[2] ZHOU X, YAO C, WEN H, et al. EAST: an efficient and accurate scene text detector[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017: 5551-5560.
[3] MILLETARI F, NAVAB N, AHMADI S A. V-net: fully convolutional neural networks for volumetric medical image segmentation[C]//2016 Fourth International Conference on 3D Vision (3DV), 2016: 565-571.
[4] DENG D, LIU H, LI X, et al. PixelLink: detecting scene text via instance segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018.
[5] XU Y, WANG Y, ZHOU W, et al. TextField: learning a deep direction field for irregular scene text detection[J]. IEEE Transactions on Image Processing, 2019, 28(11): 5566-5579.
[6] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(4): 834-848.
[7] WANG J, YUAN Y, YU G. Face attention network: an effective face detector for the occluded faces[J]. arXiv:1711. 07246, 2017.
[8] LIN T, WANG Y, LIU X, et al. A survey of transformers[J]. arXiv:2106.04554, 2021.
[9] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision， 2021: 10012-10022.
[10] HE K, GKIOXARI G, DOLLáR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision，2017: 2961-2969.
[11] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading[C]//2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015: 1156-1160.
[12] CH’NG C K, CHAN C S. Total-text: a comprehensive dataset for scene text detection and recognition[C]//2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017: 935-942.
[13] LIU Y L,?JIN L W,?ZHANG S T, et al. Detecting curve text in the wild: new dataset and new solution[J]. arXiv:1712. 02170, 2017.
[14] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Advances in Neural Information Processing Systems, 2015.
[15] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[16] SU J, LIU Z, ZHANG J, et al. DV-Net: accurate liver vessel segmentation via dense connection model with D-BCE loss function[J]. Knowledge-Based Systems, 2021, 232: 107471.
[17] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[18] MA J, SHAO W, YE H, et al. Arbitrary-oriented scene text detection via rotation proposals[J]. IEEE Transactions on Multimedia, 2018, 20(11): 3111-3122.
[19] SHI B, BAI X, BELONGIE S. Detecting oriented text in natural images by linking segments[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2550-2558.
[20] WANG W, XIE E, LI X, et al. Shape robust text detection with progressive scale expansion network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9336-9345.
[21] LONG S, RUAN J, ZHANG W, et al. Textsnake: a flexible representation for detecting text of arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 20-36.
[22] LYU P, LIAO M, YAO C, et al. Mask textSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 67-83.