Research on Multi-Target Animal Pose Estimation Based on Improved High Resolution Network

doi:10.3778/j.issn.1002-8331.2303-0288

Abstract

Abstract: In the task of animal pose estimation, various occlusion conditions of multi-target animal pose estimation will lead to poor detection effect of animal key points. To solve this problem, a multi-objective animal attitude estimation network PAENet based on improved high resolution network is proposed. Firstly, the bottleneck module of the high resolution network is redesigned by using hybrid convolutional ACmix, which integrates the self-attention mechanism to enhance the capability of extracting large-scale features. Then, a PSAsblock basic module of the series channel attention mechanism and spatial attention mechanism is proposed to extract the multi-scale features of animal posture efficiently. Finally, the feature fusion part of the network output is redesigned to make full use of the feature information of low resolution branches. At the same time, the prediction accuracy of heat map regression of the network is further improved by adding deconvolution module. Experiments are carried out on AP10K, a newly published benchmark dataset for large-scale mammal pose estimation. The results show that compared with the current high resolution network used for animal pose estimation, the average precision mAP of PAENet increases by 2.4 percentage points, and the accuracy [APM] of medium object detection increases by 3.6 percentage points. It effectively enhances the ability of the network to extract key occlusion features in the multi-target animal attitude estimation.

Key words: multi-objective animal pose estimation, high resolution network, attention mechanism, multi-scale features

摘要： 在动物姿态估计任务中，多目标动物姿态估计的各类遮挡情况，会导致动物关键点的检测效果不佳。针对该问题，提出基于改进高分辨网络的多目标动物姿态估计网络PAENet。使用融合了自注意力机制的混合卷积ACmix，重新设计了高分辨率网络的瓶颈模块，以增强网络对大尺度特征的提取能力；提出了串联通道注意力机制和空间注意力机制的PSAsblock基础模块，对动物姿态的多尺度特征进行高效提取；重新设计网络输出的特征融合部分，以充分利用低分辨率分支的特征信息，通过加入反卷积模块进一步提升网络的热图回归预测准确率。在最新公开的大规模哺乳动物姿态估计基准数据集AP10K上进行实验，结果表明，PAENet相比当前用于动物姿态估计的高分辨率网络，平均精度mAP提升了2.4个百分点，中型物体检测准确率[APM]提升了3.6个百分点，有效增强了网络在多目标动物姿态估计中遮挡关键点特征的提取能力。

关键词: 多目标动物姿态估计, 高分辨率网络, 注意力机制, 多尺度特征

XU Guidong, XU Yang, DENG Hui, MO Han. Research on Multi-Target Animal Pose Estimation Based on Improved High Resolution Network[J]. Computer Engineering and Applications, 2023, 59(22): 182-192.

徐贵冬, 徐杨, 邓辉, 莫寒. 改进高分辨率网络的多目标动物姿态估计研究[J]. 计算机工程与应用, 2023, 59(22): 182-192.

References

[1] LI C，LEE G H.From synthetic to real：unsupervised domain adaptation for animal pose estimation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：1482-1491.
[2] NG X L，ONG K E，ZHENG Q，et al.Animal kingdom：a large and diverse dataset for animal behavior understanding[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：19023-19034.
[3] JIANG L，LEE C，TEOTIA D，et al.Animal pose estimation：a closer look at the state-of-the-art，existing gaps and opportunities[J].Computer Vision and Image Understanding，2022，222：103483.
[4] 漆愚，苏菡，侯蓉，等.基于高分辨率网络的大熊猫姿态估计方法[J].兽类学报，2022，42（4）：451-460.
QI Y，SU H，HOU R，et al.Giant panda pose estimation method based on high resolution net[J].Acta Theriologica Sinica，2022，42（4）：451-460.
[5] BARADEL F，WOLF C，MILLE J，et al.Glimpse clouds：human activity recognition from unstructured feature points[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：469-478.
[6] ARAC A，ZHAO P，DOBKIN B H，et al.DeepBehavior：a deep learning toolbox for automated analysis of animal and human behavior imaging data[J].Frontiers in Systems Neuroscience，2019，13：20.
[7] MAZHAR O，RAMDANI S，NAVARRO B，et al.Towards real-time physical human-robot interaction using skeleton information and hand gestures[C]//Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems，2018：1-6.
[8] ANDRILUKA M，PISHCHULIN L，GEHLER P，et al.2D human pose estimation：new benchmark and state of the art analysis[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition，2014：3686-3693.
[9] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision，Zurich，Sep 6-12，2014：740-755.
[10] 邓辉，徐杨.融入注意力和密集连接的轻量型人体姿态估计[J].计算机工程与应用，2022，58（16）：265-273.
DENG H，XU Y.Lightweight human pose estimation based on attention and dense connection[J].Computer Engineering and Applications，2022，58（16）：265-273.
[11] CAO J，TANG H，FANG H S，et al.Cross-domain adaptation for animal pose estimation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision，2019：9498-9507.
[12] MU J，QIU W，HAGER G D，et al.Learning from synthetic animals[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：12386-12395.
[13] LAUER J，ZHOU M，YE S，et al.Multi-animal pose estimation，identification and tracking with DeepLabCut[J].Nature Methods，2022，19（4）：496-504.
[14] YU H，XU Y，ZHANG J，et al.AP-10K：a benchmark for animal pose estimation in the wild[J].arXiv：2108.12617，2021.
[15] 张雯雯，徐杨，白芮，等.基于改进堆叠沙漏网络的动物姿态估计[J].计算机工程，2023，49（2）：263-270.
ZHANG W W，XU Y，BAI R，et al.Animal pose estimation based on improved stacked hourglass network[J].Computer Engineering，2023，49（2）：263-270.
[16] ZHOU F，JIANG Z，LIU Z，et al.Structured context enhancement network for mouse pose estimation[J].IEEE Transactions on Circuits and Systems for Video Technology，2021，32（5）：2787-2801.
[17] NEWELL A，YANG K，DENG J.Stacked hourglass networks for human pose estimation[C]//Proceedings of the 14th European Conference on Computer Vision，Amsterdam，Oct 11-14，2016：483-499.
[18] CHEN Y，WANG Z，PENG Y，et al.Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：7103-7112.
[19] XIAO B，WU H，WEI Y.Simple baselines for human pose estimation and tracking[C]//Proceedings of the 15th European Conference on Computer Vision，2018：466-481.
[20] SUN K，XIAO B，LIU D，et al.Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5693-5703.
[21] CHENG B，XIAO B，WANG J，et al.HigherHRNet：scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：5386-5395.
[22] LIU H，LIU F，FAN X，et al.Polarized self-attention：towards high-quality pixel-wise regression[J].arXiv：2107.
00782，2021.
[23] PAN X，GE C，LU R，et al.On the integration of self-attention and convolution[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：815-825.
[24] GUO M H，XU T X，LIU J J，et al.Attention mechanisms in computer vision：a survey[J].Computational Visual Media，2022，8（3）：331-368.
[25] HU J，SHEN L，SUN G.Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：7132-7141.
[26] CHEN L，ZHANG H，XIAO J，et al.SCA-CNN：spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：5659-5667.
[27] WOO S，PARK J，LEE J Y，et al.CBAM：convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision，2018：3-19.
[28] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[29] YUAN Y，FU R，HUANG L，et al.HRFormer：high-resolution vision transformer for dense predict[C]//Advances in Neural Information Processing Systems 34，2021：7281-7293.