Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (18): 175-186.DOI: 10.3778/j.issn.1002-8331.2406-0242

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Vision-Tactile Fusion Method for Object Recognition Combining Spatio-Temporal Attention

LIU Jia, LI Wenlong, CHEN Dapeng, ZHANG Song, HUANG Xiaorong   

  1. 1.Tianchang Research Institute, Nanjing University of Information Science & Technology, Chuzhou, Jiangsu 239300, China
    2.School of Automation, Nanjing University of Information Science & Technology, Nanjing 210044, China
    3.Jiangsu Province Engineering Research Center of Intelligent Meteorological Exploration Robot, Nanjing 210044, China
    4.Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing 210044, China
  • Online:2025-09-15 Published:2025-09-15

结合时空注意力的视触融合目标识别方法

刘佳,栗文龙,陈大鹏,张松,黄孝荣   

  1. 1.南京信息工程大学 天长研究院,安徽 滁州 239300
    2.南京信息工程大学 自动化学院,南京 210044
    3.江苏省智能气象探测机器人工程研究中心,南京 210044
    4.江苏省大气环境与装备技术协同创新中心,南京 210044

Abstract: A spatio-temporal attention-based vision-tactile fusion method for object recognition is proposed to address the inadequacy in handling spatio-temporal and cross-modal heterogeneous information in continuous visual and tactile frames. The method begins with using Swin Transformer modules to extract features from visual and tactile images, thereby reducing cross-modal heterogeneity. It then employs a spatio-temporal Transformer module based on an attention bottleneck mechanism to enable spatio-temporal and cross-modal interactions between visual and tactile features. Following this, a multi-head self-attention fusion module adaptively aggregates information from these features, enhancing object recognition performance. Finally, a fully connected layer produces the recognition results. The accuracy and F1 score of this model on The Touch and Go dataset are 98.38% and 96.83%, respectively, which are 0.90 and 0.63 percentage points higher than the best contrast model. Additionally, ablation experiments validate the effectiveness of each proposed module. This approach significantly improves the handling of spatio-temporal and cross-modal information, offering a robust solution for advanced object recognition in intelligent robotics.

Key words: multimodal fusion, object recognition, vision-tactile fusion, Transformer, self-attention, spatio-temporal information

摘要: 针对目前智能机器人领域中,利用多帧连续视觉和触觉信息时,对时空信息和模态间的异构信息处理不足的问题,提出了一种结合时空注意力的视触融合目标识别方法。该方法利用Swin Transformer模块从视觉和触觉图像中分别提取特征,减轻模态间的异构性;使用基于注意力瓶颈机制的时空Transformer模块,实现视觉和触觉特征信息的时空交互和跨模态交互;通过多头自注意力融合模块,实现视触觉特征中信息的自适应聚合,提高了算法对目标识别的准确性;通过全连接层获得目标识别的结果。该模型在The Touch and Go公共数据集上的精确率和F1分数分别为98.38%和96.83%,比效果最好的对比模型提高了0.90和0.63个百分点。此外,消融实验也验证了提出的各个模块的有效性。

关键词: 多模态融合, 目标识别, 视触融合, Transformer, 自注意力, 时空信息