特征注意力Transformer模块在3D唇语序列身份识别中的应用

doi:10.3778/j.issn.1002-8331.2211-0295

摘要/Abstract

摘要： 唇语行为是一种新兴起的生物特征识别技术，三维（three-dimensional，3D）唇语点云序列因包含真实嘴唇空间结构和运动信息，已成为个体身份识别的重要生物特征。但是，3D点云的无序与非结构化的特点导致时空特征的提取非常困难。为此，提出一种深度学习网络模型，用于3D唇语序列身份识别。该网络采用四层改进的PointNet++作为网络骨干，以分层方式抽取特征，为了学习到更多包含身份信息的时空特征，设计一种动态唇特征注意力Transformer模块，连接于PointNet++网络每一层之后，可以学习到不同特征图之间的相关信息，有效捕捉视频序列不同帧的上下文信息。与其他注意力机制构建的Transformer相比，提出的Transformer模块具有较少的参数，在S3DFM-FP和S3DFM-VP数据集上进行的实验表明，提出网络模型在3D唇语点云序列的身份识别任务中效果显著，即使在不受姿态约束的S3DFM-VP数据集中也表现出良好的性能。

关键词: 说话人识别, Transformer, PointNet++, 三维唇语点云

Abstract: Lip behavior is a newly emerging biometric recognition technology, and 3D lip point cloud sequences have become an important biometric feature for individual identification because they contain real lip spatial structure and motion information. The disorder and unstructured characteristics of the 3D point cloud, however, make the extraction of spatio-temporal features very difficult. To this end, a deep learning network model based on point feature Transformer is proposed for 3D lip sequences identification. This network uses an improved four-layer PointNet++ as the backbone of the network to extract features in a layered manner. And, an attention Transformer module with dynamic lip features is designed and added behind each layer of the PointNet++ network in order to learn more spatio-temporal features containing identity information, which is beneficial to learn the relevant information among different feature maps and to capture effectively contextual information among different video sequence frames. Compared with the Transformers constructed by other attention mechanisms, the Transformer module proposed in this paper has fewer parameters, and experimental results on the S3DFM-FP and S3DFM-VP datasets show that the proposed network model is effective for the identification task of 3D lip point cloud sequences. Even on the S3DFM-VP dataset, which is not constrained by pose, the proposed network model shows better performance.

Key words: speaker recognition, Transformer, PointNet++, three-dimensional lip point cloud

骈鑫洋, 王瑜, 张洁. 特征注意力Transformer模块在3D唇语序列身份识别中的应用[J]. 计算机工程与应用, 2024, 60(7): 141-146.

PIAN Xinyang, WANG Yu, ZHANG Jie. Applying Attention Transformer Module to 3D Lip Sequence Identification[J]. Computer Engineering and Applications, 2024, 60(7): 141-146.

参考文献

[1] 邓飞, 邓力洪, 胡文艺, 等. 说话人身份识别深度网络中的聚合模型研究[J]. 计算机应用研究, 2022, 39(3): 721-725.
DENG F, DENG L H, HU W Y, et al. Research on aggregation model in speaker recognition deep network[J]. Application Research of Computers, 2022, 39(3): 721-725.
[2] QI C R, SU H, MO K C, et al. Pointnet: deep learning on point sets for 3D classification and segmentation[C]Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, 2017: 77-85.
[3] QI C R, YI L, SU H, et al. Pointnet++: deep hierarchical feature learning on point sets in a metric space[C]//Advances in Neural Information Processing Systems, 2017: 5099-5108.
[4] MIN Y C, ZHANG Y X, CHAI X J, et al. An efficient pointlstm for point clouds based gesture recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Gjovik, Norway, 2020: 5761-5770.
[5] FAN H, YANG Y, KANKANHALLI M. Point 4D Transformer networks for spatio-temporal modeling in point cloud videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021: 2575-7075.
[6] ZHANG J, FISHER R B. 3D visual passcode: speech-driven 3D facial dynamics for behaviometrics[J]. Signal Processing, 2019, 160: 164-177.
[7] BRAHME A, BHADADE U. Lip detection and lip geometric feature extraction using constrained local model for spoken language identification using visual speech recognition[J]. Indian Journal of Science and Technology, 2016, 9: 1-7.
[8] ESTELLERS V, THIRAN J P. Multi-pose lipreading and audio-visual speech recognition[J]. EURASIP Journal on Advances in Signal Processing, 2012, 2012: 1-23.
[9] ZHANG J, RICHMOND K, FISHER R B. Dual-modality talking-metrics: 3D visual-audio integrated behaviometric cues from speakers[C]//Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 2018: 3144-3149.
[10] 刘心溥, 马燕新, 许可, 等. 嵌入Transformer结构的多尺度点云补全[J]. 中国图象图形学报, 2022, 27(2): 538-549.
LIU X F, MA Y X, XU K, et al. Multi-scale Transformer based point cloud completion network[J]. Journal of Image and Graphics, 2022, 27(2): 538-549.
[11] MIN Y C, CHAI X J, ZHAO L, et al. FlickerNet: adaptive 3D gesture recognition from sparse point clouds[C]//Proceedings of the British Machine Vision Conference (BMVC), Cardiff, United Kingdom, 2019.
[12] HUANG Z, WANG X, WEI Y, et al. CCNet: criss-cross attention for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 6896-6908.
[13] HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42: 2011-2023.
[14] WOO S, PARK J, LEE J, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018: 3-19.
[15] XIE S N, LIU S N, CHEN Z Y, et al. Attentional ShapeContextNet for point cloud recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018: 4606-4615.
[16] FENG M T, ZHANG L, LIN X F, et al. Point attention network for semantic segmentation of 3D point clouds[J]. Pattern Recognition, 2020, 107: 107446.
[17] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018: 7794-7803.
[18] FAN H H, YU X, DING Y H, et al. PSTNet: point spatio-temporal convolution on point cloud sequences[C]//Proceedings of the International Conference on Learning Representations (ICLR), Austria, 2021.
[19] LI X, HUANG Q, WANG Z J, et al. SequentialPointNet: a strong parallelized point cloud sequence network for 3D action recognition[J]. arXiv:2111.08492, 2021.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	90

	来源	本网站

	次数	90
	比例	100%

摘要

最新录用	在线预览	正式出版

0	0	69

	来源	本网站

	次数	69
	比例	100%