计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (11): 176-184.DOI: 10.3778/j.issn.1002-8331.2403-0361

• 模式识别与人工智能 • 上一篇    下一篇

基于跨模态和循环分解自注意力的场景文本识别

徐诗康,刘俊峰,曾君,廖丁丁   

  1. 1.华南理工大学 自动化科学与工程学院,广州 510641
    2.华南理工大学 电力学院,广州 510641
  • 出版日期:2025-06-01 发布日期:2025-05-30

Scene Text Spotting Based on Cross-Modal and Circular Factorized Self-Attention

XU Shikang, LIU Junfeng, ZENG Jun, LIAO Dingding   

  1. 1.School of Automation Science and Technology, South China University of Technology, Guangzhou 510641, China
    2.School of Electric Power Engineering, South China University of Technology, Guangzhou 510641, China
  • Online:2025-06-01 Published:2025-05-30

摘要: 针对目前端到端场景文本识别方法通常将文本检测和文本识别两个子任务整合到一个统一的框架中,而没有充分考虑文本检测和识别之间的交互和协同的问题,提出了一种基于跨模态和循环分解自注意力的端到端场景文本识别方法。基于缩放点积注意力机制,设计了一种跨模态模块,旨在增强视觉信息和语义信息的融合,从而增强文本检测和识别之间的交互。采用了带有循环卷积的循环分解自注意力替代解码器中的自注意力,以更好地捕捉文本实例的轮廓特征,从而提高文本检测的性能。在Total-Text、CTW1500和ICDAR 2015数据集上的大量实验表明,该方法相较于当前主流方法在文本检测的准确率、召回率、F值和端到端文本识别的准确率上均有较为明显的提升,并且消融实验也证明了所提方法的有效性。

关键词: 场景文本识别, 跨模态, 循环卷积, 分解自注意力, 特征融合

Abstract: Current end-to-end scene text spotting methods usually integrate the two subtasks of text detection and recognition into a unified framework without sufficiently considering the interaction and synergy. Aiming at these issues, an end-to-end scene text spotting method based on cross-modal and circular factorized self-attention is proposed. Firstly, based on the scaled dot-product attention mechanism, a cross-modal module is designed with the aim of enhancing the fusion of visual and semantic information, thus enhancing the interaction between text detection and recognition. Then, a circular factorized self-attention with circular convolution is employed instead of the self-attention in the decoder to better capture the contour features of text instances and thus improve the performance of text detection. Finally, extensive experiments on Total-Text, CTW1500, and ICDAR 2015 datasets show that the proposed method has a more significant improvement compared to the current mainstream methods in terms of the accuracy, recall, F value of text detection and the accuracy of end-to-end text spotting. Moreover, the ablation experiments demonstrate the effectiveness of the proposed method.

Key words: scene text spotting, cross-modal, circular convolution, factorized self-attention, feature fusion