计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (7): 141-146.DOI: 10.3778/j.issn.1002-8331.2211-0295

• 模式识别与人工智能 • 上一篇    下一篇

特征注意力Transformer模块在3D唇语序列身份识别中的应用

骈鑫洋,王瑜,张洁   

  1. 1.北京工商大学 人工智能学院,北京 100048
    2.北京科技大学 自动化学院,北京 100083
  • 出版日期:2024-04-01 发布日期:2024-04-01

Applying Attention Transformer Module to 3D Lip Sequence Identification

PIAN Xinyang, WANG Yu, ZHANG Jie   

  1. 1.School of Artificial Intelligence, Beijing Technology and Business University, Beijing 100048, China
    2.School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China
  • Online:2024-04-01 Published:2024-04-01

摘要: 唇语行为是一种新兴起的生物特征识别技术,三维(three-dimensional,3D)唇语点云序列因包含真实嘴唇空间结构和运动信息,已成为个体身份识别的重要生物特征。但是,3D点云的无序与非结构化的特点导致时空特征的提取非常困难。为此,提出一种深度学习网络模型,用于3D唇语序列身份识别。该网络采用四层改进的PointNet++作为网络骨干,以分层方式抽取特征,为了学习到更多包含身份信息的时空特征,设计一种动态唇特征注意力Transformer模块,连接于PointNet++网络每一层之后,可以学习到不同特征图之间的相关信息,有效捕捉视频序列不同帧的上下文信息。与其他注意力机制构建的Transformer相比,提出的Transformer模块具有较少的参数,在S3DFM-FP和S3DFM-VP数据集上进行的实验表明,提出网络模型在3D唇语点云序列的身份识别任务中效果显著,即使在不受姿态约束的S3DFM-VP数据集中也表现出良好的性能。

关键词: 说话人识别, Transformer, PointNet++, 三维唇语点云

Abstract: Lip behavior is a newly emerging biometric recognition technology, and 3D lip point cloud sequences have become an important biometric feature for individual identification because they contain real lip spatial structure and motion information. The disorder and unstructured characteristics of the 3D point cloud, however, make the extraction of spatio-temporal features very difficult. To this end, a deep learning network model based on point feature Transformer is proposed for 3D lip sequences identification. This network uses an improved four-layer PointNet++ as the backbone of the network to extract features in a layered manner. And, an attention Transformer module with dynamic lip features is designed and added behind each layer of the PointNet++ network in order to learn more spatio-temporal features containing identity information, which is beneficial to learn the relevant information among different feature maps and to capture effectively contextual information among different video sequence frames. Compared with the Transformers constructed by other attention mechanisms, the Transformer module proposed in this paper has fewer parameters, and experimental results on the S3DFM-FP and S3DFM-VP datasets show that the proposed network model is effective for the identification task of 3D lip point cloud sequences. Even on the S3DFM-VP dataset, which is not constrained by pose, the proposed network model shows better performance.

Key words: speaker recognition, Transformer, PointNet++, three-dimensional lip point cloud