融合深度残差网络的双网3D手势交互算法

doi:10.3778/j.issn.1002-8331.2207-0022

摘要/Abstract

摘要： 针对现有3D手势交互方法存在的精度、成本、泛用性三角抉择难题，受限于发展程度现有方法大多只能三者取其二，传统识别方法通常优先考虑成本和泛用性从而在精度上有所缺失，基于数据手套的3D手势交互保证了精度和泛用性的情况下无法解决过高的成本问题，以及针对特定领域开发（例如手语教学、交通手势识别）的方法在保证精度和成本的情况下仅在当前小范围领域内适用。为了平衡这一三角难题，找到三者之间的平衡点，提出了一种新的最低基于单目摄像技术的3D手势姿态估计算法。其峰值性能在保证识别精度的情况下，最高可达85~100?FPS。该算法是基于双网络融合架构来实现，主体架构分为前置识别网络和后置纠正网络，前置网络是基于ResNet50网络设计的二维关键点映射至三维的手部关键点检测，后置网络是基于逆运动学设计的纠正网络用来纠正映射偏差带来的非常规手型，使得识别结果更加顺滑，符合真实人体结构。架构的精准设计使得其可以利用更多范围内可用的手部训练数据源，在此基础上加入的后置纠正网络模块用来纠正手部细节，这种输出方式能够使得该算法更直接地应用于视觉设计和图形交互输出。经过多次实验对比后，相较于传统算法，该架构设计在保证识别速度的情况下，在各类数据集上平均识别精度提升1%~2%，同时基于MANO的建模效果更加精致逼真。

关键词: 手势交互, 残差网络, 逆运动学, 双网融合, 3D建模

Abstract: Aiming at a series of problems existing in the existing 3D gesture interaction methods, such as low accuracy and poor reduction, this paper proposes a new 3D gesture pose estimation method based on monocular camera technology. Its peak performance can reach 85~100 FPS under the condition of ensuring the recognition accuracy. This method is based on the dual network fusion architecture. The precise design of the architecture makes it possible to use all available hand training data sources. On this basis, the post correction network module based on inverse kinematics design is added to correct hand details. This output method can be more directly applied to visual design and graphic interactive output. After many experiments and comparisons, compared with the traditional algorithm, the architecture design of this paper improves the average recognition accuracy by 1%~2% while ensuring the recognition speed. At the same time, the modeling effect based on MANO is more refined and realistic.

Key words: gesture interaction, residual network, inverse kinematics, dual network integration, 3D modeling

薛佳伟, 孔韦韦, 王泽. 融合深度残差网络的双网3D手势交互算法[J]. 计算机工程与应用, 2023, 59(20): 176-183.

XUE Jiawei, KONG Weiwei, WANG Ze. Dual-Network Fusion 3D Gesture Interaction Algorithm Based on Deep Residual Network[J]. Computer Engineering and Applications, 2023, 59(20): 176-183.

参考文献

[1] 张维，林泽一，程坚，等.动态手势理解与交互综述[J].软件学报，2021，32（10）：3051-3067.
ZHANG W，LIN Z Y，CHENG J，et al.Survey of dynamic hand gesture understanding and interaction[J].Journal of Software，2021，32（10）：3051-3067.
[2] ZHANG J，JIAO J，CHENM L，et al.3D hand pose tracking and estimation using stereo matching[C]//IEEE International Conference on Image Processing（ICIP），2017：982-986.
[3] SRINATH S，FRANZISKA M，MICHAEL Z，et al.Real-time joint tracking of a hand manipulating an object from RGB-D input[C]//The European Conference on Computer Vision（ECCV），2016：294-310.
[4] FRANZISKA M，FLROIAN B，OLEKSANDR S C D M，et al.Ganerated hands for real-time 3D hand tracking from monocular RGB[C]//IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2018：49-59.
[5] TOMAS S，HANBYUL J，IAIN M，et al.Hand keypoint detection in single images using multiview bootstrapping[C]//IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2017：1145-1153.
[6] LI Y，DI H，XIN Y，et al.Optical fiber data glove for hand posture capture[J].Optik-International Journal for Light and Electron Optics，2021：233-246.
[7] CAI Y J，GE L H，CAI J F，et al.Weakly-supervised 3D hand pose estimation from monocular RGB images[C]//European Conference on Computer Vision（ECCV），2018：666-682.
[8] UMAR I，PAVLO M，THOMAS B J G，et al.Hand pose estimation via latent 2.5D heatmap regression[C]//European Conference on Computer Vision（ECCV），2018：118-134.
[9] ADRIAN S，SONG J，SEONWOOK P，et al.Cross-modal deep variational hand pose estimation[C]//IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2018：89-98.
[10] DAVISON A J，REID L D，MOLTON N D.MonoSLAM：real-time single camera SLAM[C]//IEEE Transactions on Pattern Analysis & Machine Intelligence，2007：1052-1067.
[11] PASCHALIS P，IASON O，ANTONIS A R.Using a single RGB frame for real time 3D hand pose estimation in the wild[C]//2018 IEEE Winter Conference on Applications of Computer Vision，2018：436-445.
[12] SEUNGRYUL B，WANG K，TAEK K.Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering[C]//IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2019：1067-1076.
[13] ZHI Y X，LUKASIK M M，LI M H，et al.Automatic detection of compensation during robotic stroke rehabilitation therapy[J].IEEE Journal of Translational Engineering in Health and Medicine，2018（6）：1-7.
[14] KICIROGLU S，RHODIN H，SINHA S N，et al.ActiveMoCap：optimized viewpoint selection for active human motion capture[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2020：103-112.
[15] ROMERO J，TZIONAS D，BLACK M J.Embodied hands：modeling and capturing hands and bodies together[J].ACM Transactions on Graphics，2022：2201-2210.
[16] BRAY M，KOHLI P，TORR P.Posecut：simultaneous segmentation and 3D pose estimation of humans using dynamic graph-cuts[C]//Computer Vision-ECCV 2006，9th European Conference on Computer Vision，Graz，Austria，May 7-13，2006：642-655.
[17] CORONA E，PUMAROLA A，ALENYA G，et al.GanHand：predicting human grasp affordances in multi-object scenes[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2020：5031-5041.
[18] REN J，ZHU J，ZHANG J.End-to-end weakly-supervised multiple 3D hand mesh reconstruction from single image[EB/OL].（2022-04-18）[2022-06-15].http：//doi.org/10.48550/arXiv.22.4.08154.
[19] MALIK J，ELHAYEK A，NUNNARI F，et al.DeepHPS：end-to-end estimation of 3D hand pose and shape by learning from synthetic depth[C]//2018 International Conference on 3D Vision（3DV），2018：110-119.
[20] HU H，CAO Z，YANG X，et al.Performance evaluation of optical motion capture sensors for assembly motion capturing[J].IEEE Access，2021：61444-61454.
[21] DENG X，ZHANG Y，SHI J，et al.Hand pose understanding with large-scale photo-realistic rendering dataset[J].IEEE Transactions on Image Processing，2021：4275-4290.
[22] CHEN Y，TU Z，KANG D，et al.Joint hand-object 3D reconstruction from a single image with cross-branch feature fusion[J].IEEE Transactions on Image Processing，2021：4008-4021.
[23] ADNANE B，RODRIGOD B，PHILIP H.S.3D hand shape and pose from images in the wild[C]//IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2019：10843-10852.
[24] GE L H，ZHOU R，LI Y C，et al.3D hand shape and pose estimation from a single RGB image[C]//IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2019：10833-10842.
[25] YANG L L，LI S L，LEE D H，et al.Aligning latent spaces for 3D hand pose estimation[C]//IEEE International Conference on Computer Vision（ICCV），2019：2335-2343.