Scene Text Spotting Based on Cross-Modal and Circular Factorized Self-Attention

doi:10.3778/j.issn.1002-8331.2403-0361

Abstract

Abstract: Current end-to-end scene text spotting methods usually integrate the two subtasks of text detection and recognition into a unified framework without sufficiently considering the interaction and synergy. Aiming at these issues, an end-to-end scene text spotting method based on cross-modal and circular factorized self-attention is proposed. Firstly, based on the scaled dot-product attention mechanism, a cross-modal module is designed with the aim of enhancing the fusion of visual and semantic information, thus enhancing the interaction between text detection and recognition. Then, a circular factorized self-attention with circular convolution is employed instead of the self-attention in the decoder to better capture the contour features of text instances and thus improve the performance of text detection. Finally, extensive experiments on Total-Text, CTW1500, and ICDAR 2015 datasets show that the proposed method has a more significant improvement compared to the current mainstream methods in terms of the accuracy, recall, F value of text detection and the accuracy of end-to-end text spotting. Moreover, the ablation experiments demonstrate the effectiveness of the proposed method.

Key words: scene text spotting, cross-modal, circular convolution, factorized self-attention, feature fusion

摘要： 针对目前端到端场景文本识别方法通常将文本检测和文本识别两个子任务整合到一个统一的框架中，而没有充分考虑文本检测和识别之间的交互和协同的问题，提出了一种基于跨模态和循环分解自注意力的端到端场景文本识别方法。基于缩放点积注意力机制，设计了一种跨模态模块，旨在增强视觉信息和语义信息的融合，从而增强文本检测和识别之间的交互。采用了带有循环卷积的循环分解自注意力替代解码器中的自注意力，以更好地捕捉文本实例的轮廓特征，从而提高文本检测的性能。在Total-Text、CTW1500和ICDAR 2015数据集上的大量实验表明，该方法相较于当前主流方法在文本检测的准确率、召回率、F值和端到端文本识别的准确率上均有较为明显的提升，并且消融实验也证明了所提方法的有效性。

关键词: 场景文本识别, 跨模态, 循环卷积, 分解自注意力, 特征融合

XU Shikang, LIU Junfeng, ZENG Jun, LIAO Dingding. Scene Text Spotting Based on Cross-Modal and Circular Factorized Self-Attention[J]. Computer Engineering and Applications, 2025, 61(11): 176-184.

徐诗康, 刘俊峰, 曾君, 廖丁丁. 基于跨模态和循环分解自注意力的场景文本识别[J]. 计算机工程与应用, 2025, 61(11): 176-184.

References

[1] ZHANG C S, TAO Y F, DU K, et al. Character-level street view text spotting based on deep multisegmentation network for smarter autonomous driving[J]. IEEE Transactions on Artificial Intelligence, 2022, 3(2): 297-308.
[2] 刘成林, 金连文, 白翔, 等. 文档智能分析与识别前沿: 回顾与展望[J]. 中国图象图形学报, 2023, 28(8): 2223-2252.
LIU C L, JIN L W, BAI X, et al. Frontiers of intelligent document analysis and recognition: review and prospects[J]. Journal of Image and Graphics, 2023, 28(8): 2223-2252.
[3] 刘艳菊, 伊鑫海, 李炎阁, 等. 深度学习在场景文字识别技术中的应用综述[J]. 计算机工程与应用, 2022, 58(4): 52-63.
LIU Y J, YI X H, LI Y G, et al. Application of scene text recognition technology based on deep learning: a survey[J]. Computer Engineering and Applications, 2022, 58(4): 52-63.
[4] LIAO M, SHI B, BAI X. TextBoxes++: a single-shot oriented scene text detector[J]. IEEE Transactions on Image Processing, 2018, 27(8): 3676-3690.
[5] NEUMANN L, MATAS J. Real-time lexicon-free scene text localization and recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(9): 1872-1885.
[6] QIAO L, TANG S L, CHENG Z Z, et al. Text perceptron: towards end-to-end arbitrary-shaped text spotting[J]. arXiv: 2002.06820, 2020.
[7] ZHANG X, SU Y W, TRIPATHI S, et al. Text spotting transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 9509-9518.
[8] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[9] KITTENPLON Y, LAVI I, FOGEL S, et al. Towards weakly-supervised text spotting using a multi-task transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 4594-4603.
[10] XING L J, TIAN Z, HUANG W L, et al. Convolutional character networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 9125-9135.
[11] LYU P Y, LIAO M H, YAO C, et al. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 71-88.
[12] QIAO L, CHEN Y, CHENG Z Z, et al. MANGO: a mask attention guided one-stage scene text spotter[J]. arXiv:2012. 04350, 2020.
[13] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 213-229.
[14] RAISI Z, NAIEL M A, YOUNES G, et al. Transformer-based text detection in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2021: 3156-3165.
[15] TANG J Q, ZHANG W Q, LIU H Y, et al. Few could be better than all: feature sampling and grouping for scene text detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 4553-4562.
[16] LIU S L, LI F, ZHANG H, et al. DAB-DETR: dynamic anchor boxes are better queries for DETR[J]. arXiv:2201. 12329, 2022.
[17] HUANG M X, LIU Y L, PENG Z H, et al. SwinTextSpotter: scene text spotting via better synergy between text detection and text recognition[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 4583-4593.
[18] 陈佐瓒, 徐兵, 丁小军, 等. 基于Encoder-Decoder框架的双监督机制自然场景文本识别[J]. 计算机工程与应用, 2022, 58(6): 128-133.
CHEN Z Z, XU B, DING X J, et al. Natural scene text recognition based on encoder-decoder framework with dual supervision mechanism[J]. Computer Engineering and Applications, 2022, 58(6): 128-133.
[19] WANG K, BABENKO B, BELONGIE S. End-to-end scene text recognition[C]//Proceedings of the 2011 International Conference on Computer Vision. Piscataway: IEEE, 2011: 1457-1464.
[20] BISSACCO A, CUMMINS M, NETZER Y, et al. PhotoOCR: reading text in uncontrolled conditions[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2013: 785-792.
[21] SHI B, BAI X, YAO C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(11): 2298-2304.
[22] LI H, WANG P, SHEN C H. Towards end-to-end text spotting with convolutional recurrent neural networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5248-5256.
[23] LIU X B, LIANG D, YAN S, et al. FOTS: fast oriented text spotting with a unified network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 5676-5685.
[24] LIAO M H, PANG G, HUANG J, et al. Mask TextSpotter v3: segmentation proposal network for robust scene text spotting[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2020: 706-722.
[25] LIU Y L, CHEN H, SHEN C H, et al. ABCNet: real-time scene text spotting with adaptive Bezier-curve network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 9806-9815.
[26] LIU Y, SHEN C, JIN L, et al. ABCNet v2: adaptive bezier-curve network for real-time end-to-end text spotting[J]. IEEE Transactions on Pattern Analysis And Machine Intelligence, 2022, 44(11): 8048-8064.
[27] ZHU X, SU W, LU L, et al. Deformable DETR: deformable transformers for end-to-end object detection[J]. arXiv:2010. 04159, 2020.
[28] XUE C, ZHANG W, HAO Y, et al. Language matters: a weakly supervised vision-language pre-training approach for scene text detection and spotting[C]//Proceedings of the European Conference on Computer Vision, 2022: 284-302.
[29] SONG S B, WAN J Q, YANG Z B, et al. Vision-language pre-training for boosting scene text detectors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 15660-15670.
[30] WAN Q, JI H Q, SHEN L L. Self-attention based text knowledge mining for text detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 5979-5988.
[31] DONG Q, TU Z W, LIAO H F, et al. Visual relationship detection using part-and-sum transformers with composite queries[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 3530-3539.
[32] PENG S D, JIANG W, PI H J, et al. Deep snake for real-time instance segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 8530-8539.
[33] LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 2999-3007.
[34] REZATOFIGHI H, TSOI N, GWAK J, et al. Generalized intersection over union: a metric and a loss for bounding box regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 658-666.
[35] CH’NG C K, CHAN C S, LIU C L. Total-Text: toward orientation robustness in scene text detection[J]. International Journal on Document Analysis and Recognition, 2020, 23(1): 31-52.
[36] KARATZAS D, SHAFAIT F, UCHIDA S, et al. ICDAR 2013 robust reading competition[C]//Proceedings of the 12th International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2013: 1484-1493.
[37] KARATZAS D, GOMEZ-BIGORDA L, NICOLAOU A, et al. ICDAR 2015 competition on robust reading[C]//Proceedings of the 13th International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2015: 1156-1160.
[38] LIU Y L, JIN L W, ZHANG S T, et al. Curved scene text detection via transverse and longitudinal sequence connection[J]. Pattern Recognition, 2019, 90: 337-345.
[39] YE M Y, ZHANG J, ZHAO S S, et al. DPText-DETR: towards better scene text detection with dynamic points in transformer[J]. arXiv:2207.04491, 2022.
[40] YE M Y, ZHANG J, ZHAO S S, et al. DeepSolo: let transformer decoder with explicit points solo for text spotting[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19348-19357.
[41] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 936-944.
[42] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[J]. arXiv:1711.05101, 2017.
[43] JADERBERG M, SIMONYAN K, VEDALDI A, et al. Reading text in the wild with convolutional neural networks[J]. International Journal of Computer Vision, 2016, 116(1): 1-20.
[44] FENG W, HE W H, YIN F, et al. TextDragon: an end-to-end framework for arbitrary shaped text spotting[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 9075-9084.
[45] BAEK Y, LEE B, HAN D, et al. Character region awareness for text detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 9357-9366.