计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (12): 196-209.DOI: 10.3778/j.issn.1002-8331.2402-0117

• 图形图像处理 • 上一篇    下一篇

结合自注意力与CNN的社区行人行为检测算法

吴春龙,张荣芬,刘宇红,欧阳玉旋   

  1. 贵州大学 大数据与信息工程学院,贵阳 550025
  • 出版日期:2025-06-15 发布日期:2025-06-13

Algorithm for Community Pedestrian Behavior Detection Integrating Self-Attention Mechanisms with CNN

WU Chunlong, ZHANG Rongfen, LIU Yuhong, OUYANG Yuxuan   

  1. College of Big Data and Information Engineering, Guizhou University, Guiyang 550025, China
  • Online:2025-06-15 Published:2025-06-13

摘要: 行人行为检测算法能够有效解决传统社区安全管理主要依赖人力巡逻或事后查看监控录像导致人力物力消耗大、对紧急事件反应延缓的问题。现有行为检测模型在实际部署时主要面临两方面的挑战:一是高精度行为检测模型计算量和参数量大导致部署困难;二是低复杂度行为检测模型虽然计算量和参数量小易于部署,但精度低难以满足实际应用。面向智慧数字社区管理,使用2D卷积神经网络(2D-CNN)的视频行为检测算法TSM(temporal shift module)作为核心算法并进行优化改进,旨在提升行为检测模型精度的同时降低计算量和参数量使其易于实际应用部署。利用混合自注意力模块ACmix和GhostConv设计全新的DACGhostBottleneck1替换TSM骨干网络中大部分的Bottleneck1,降低模型计算量和参数量的同时提升模型处理长序列和理解全局信息的能力;使用GhostConv代替在TSM骨干网络中部分Bottleneck1和所有Bottleneck2中的Conv,大量减少模型参数量与计算量;提出一种融合了时间、运动状态、通道信息的TAACTION注意力模块,有效提升模型的时空建模能力;结合SCConv和Conv设计出CSCConvBlock替换TSM骨干网络阶段0的Conv,在基本不增加模型计算量的同时提升检测精度;最后应用数据增强Video Mix-up,提高模型的分类性能、泛化能力和鲁棒性。改进后的算法在实验数据集上Top1、Accuracy分别提升了5.81、6.05个百分点,参数量及计算量相比TSM原模型分别降低了48.17%、51.98%。总的来看,改进后的算法在有效减少模型参数量与计算量的同时提高了检测精度,明显优于原算法且其模型更易于实际应用部署,具有切实的应用价值。

关键词: 自注意力机制, 时空注意力机制, 行为检测, 残差网络, 智慧数字社区

Abstract: This paper focuses on intelligent digital community management and optimizes the temporal shift module (TSM) algorithm, a video behavior detection algorithm based on 2D convolutional neural networks (2D-CNN), to enhance accuracy while reducing computational and parameter requirements for practical deployment. By incorporating the hybrid self-attention module ACmix and GhostConv, a novel DACGhostBottleneck1 is introduced to replace most of the Bottleneck1 in the TSM backbone network, decreasing model complexity and enhancing long sequence processing and global information understanding. Further reductions in parameters and computation are achieved by using GhostConv to replace the Conv of partial Bottleneck1 and all Bottleneck2. The proposed TAACTION attention module, which merges temporal, motion, and channel information, significantly improves spatiotemporal modeling capabilities. Additionally, the CSCConvBlock, combining SCConv and Conv, is designed to replace the STAGE 0 Conv in the TSM backbone network, improving detection accuracy with minimal computational increase. The application of data augmentation, Video Mix-up, improves classification performance, generalization, and robustness. The improved algorithm shows a 5.81 and 6.05?percentage points increase in Top1 and Accuracy respectively on the experimental dataset, with a 48.17% and 51.98% reduction in parameter count and computational cost compared to the original TSM model. Overall, the enhanced algorithm significantly outperforms the original and its reduced complexity facilitates practical deployment, offering substantial application value.

Key words: self-attention, spatiotemporal attention, behavior recognition, ResNet, smart digital community