Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (21): 225-233.DOI: 10.3778/j.issn.1002-8331.2408-0349

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Multimodal Cross-View Contrastive Memory-Augmented Network for Self-Supervised Skeleton-Based Action Recognition

BAI Tian, GAO Yuehong, XIE Zhengguang, LI Hongjun   

  1. School of Information Science and Technology, Nantong University, Nantong, Jiangsu 226019, China
  • Online:2025-11-01 Published:2025-10-31

多模跨视图对比记忆增强网络的自监督骨架动作识别

白天,高月红,谢正光,李洪均   

  1. 南通大学 信息科学技术学院,江苏 南通 226019

Abstract: In recent years, significant progress has been made in self-supervised representation learning for human skeleton action recognition. However, most existing methods primarily rely on three modalities of data: skeletal joints, bones, and motion, failing to fully utilize the rich information available in skeleton data. Moreover, many studies fall short in extracting deep-level features from skeletal sequences and in conducting diverse contrastive learning. To address these issues, the multimodal cross-view contrastive memory-augmented network (MCCMN) model is proposed. The MCCMN model consists of three main components: firstly, in addition to the traditional three modalities, three new data modalities—acceleration, rotation axis, and angular velocity—are introduced to enrich the representation of skeleton information. Secondly, it employs a graph convolutional network (GCN) as the feature encoder, enhanced by a nonlinear projection layer to map high-dimension data features, capturing deep-level patterns and associations within the data to improve model robustness. Finally, a cross-view contrastive memory augmentation mechanism is proposed, which dynamically updates the negative sample queue to enrich the diversity of contrastive samples and leverages complementary information between different modality views to enhance the robustness of feature learning, achieving consistent contrastive learning across multimodal views. Experimental results on benchmark datasets demonstrate that the MCCMN model outperforms existing methods across various metrics, proving its effectiveness and broad application prospects in self-supervised skeleton action recognition tasks.

Key words: skeleton-based action recognition, self-supervised learning, multimodal data, contrastive learning

摘要: 近年来,人类骨架动作识别的自监督表示学习取得了显著进展。然而,大多数现有方法主要依赖于骨架关节、骨骼和运动三种模态数据,未能充分利用丰富的骨架信息。此外,大多数研究在骨架序列的深层次特征提取和多样性对比学习方面存在不足。为了解决这些问题,提出了多模跨视图对比记忆增强网络(multimodal cross-view contrastive memory-augmented network,MCCMN)模型。主要由三部分组成:在传统三模态基础上,引入加速度、旋转轴和角速度三种新数据模态,丰富骨架信息的表征。采用图卷积网络模型作为特征编码器,并通过非线性投影层映射高维数据特征,捕捉数据深层次模式与关联以提高模型鲁棒性。提出跨视图对比记忆增强机制,通过动态更新的负样本队列,丰富对比样本的多样性,并利用不同模态视图间的互补信息,增强特征的鲁棒性,实现多模态视图一致性的对比学习。基准数据集上的实验结果表明,MCCMN模型在各项指标上均优于现有方法,证明了其在自监督骨架动作识别任务中的有效性和广泛应用前景。

关键词: 骨架的动作识别, 自监督学习, 多模态数据, 对比学习