Lightweight Full-Flow Bidirectional Fusion Network for 6D Pose Estimation

doi:10.3778/j.issn.1002-8331.2307-0335

Abstract

Abstract: Six degrees of freedom (6D) pose estimation is a key step in applications such as robot grasping and manipulation, augmented reality, and autonomous driving. Conventional 6D pose estimation methods focus more on designing complex networks to improve the estimation effect, while ignoring the practical deployment difficulties due to the high complexity of the model and the large number of parameters. Based on FFB6D, this paper attempts to design a lightweight full-flow bidirectional fusion network (LFFB6D), a lightweight 6D pose estimation method based on RGBD. The method consists of two parallel encoder-decoder networks, convolutional neural network (CNN) and point cloud network (PCN). Specifically in the CNN part, this method introduces FasterNet to replace 3×3 convolution. By replacing the encoding network of CNN and proposing an upsampling module FUPB (faster upsample block) to reduce network parameters. In the PCN part, this method introduces PoolFormer to process and aggregate point cloud features. A new pooling module PFPB (PoolFormer pooling block) is proposed to improve the performance of the network. Experiments show that the parameter quantity of LFFB6D is reduced by 46% compared with FFB6D. When only 1/13 of the LineMOD training set and 1/9 of the YCB-Video training set are used, the 6D pose estimation results of LFFB6D surpass PoseCNN, DenseFusion and other methods, and achieve similar results to PVN3D and FFB6D.

Key words: RGBD, pose estimation, lightweight, FasterNet, PoolFormer

摘要： 六自由度（six degrees of freedom，6D）姿态估计是机器人抓取与操作、增强现实、自动驾驶等应用中的关键步骤。常规的6D姿态估计方法更多地侧重于设计复杂的网络来提高估计效果，而忽略了由于模型复杂度过高和参数数量庞大导致的实际部署困难问题。以FFB6D为基线，尝试设计了一个轻量级全流双向融合网络（lightweight full-flow bidirectional fusion network，LFFB6D），一种基于RGBD的轻量级6D姿态估计方法。该方法由卷积神经网络（convolutional neural network，CNN）与点云网络（point cloud network，PCN）两个并行的编码-解码网络组成。具体来说在CNN部分，引入FasterNet来代替3×3卷积。通过更换CNN的编码网络，提出了一个上采样模块FUPB（faster upsample block），以减少网络参数。在PCN部分，引入PoolFormer来处理和聚合点云特征。提出了一个新的池化模块PFPB（PoolFormer pooling block），以提高网络的性能。实验表明，LFFB6D的参数量相较FFB6D减少了46%。在仅使用1/13的LineMOD训练集和1/9的YCB-Video训练集的情况下，LFFB6D的6D姿态估计结果超越了PoseCNN、DenseFusion等方法，达到了与PVN3D和FFB6D相近的结果。

关键词: RGBD, 姿态估计, 轻量化, FasterNet, PoolFormer

LIN Haotian, LI Yongchang, JIANG Jing, QIN Guangjun. Lightweight Full-Flow Bidirectional Fusion Network for 6D Pose Estimation[J]. Computer Engineering and Applications, 2024, 60(22): 282-291.

林浩田, 李永昌, 江静, 秦广军. 用于6D姿态估计的轻量级全流双向融合网络[J]. 计算机工程与应用, 2024, 60(22): 282-291.

References

[1] DENG X, XIANG Y, MOUSAVIAN A, et al. Self-supervised 6D object pose estimation for robot manipulation[C]//Proceedings of the 2020 IEEE International Conference on Robotics and Automation, Paris, 2020: 3665-3671.
[2] SU Y, RAMBACH J, MINASKAN N, et al. Deep multi-state object pose estimation for augmented reality assembly[C]//Proceedings of the 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct, Beijing, 2019: 222-227.
[3] WU D, ZHUANG Z, XIANG C, et al. 6D-VNet: end-to-end 6DoF vehicle pose estimation from monocular RGB images[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, 2019: 1238-1247.
[4] XIANG Y, SCHMIDT T, NARAYANAN V, et al. PoseCNN: a convolutional neural network for 6d object pose estimation in cluttered scenes[J]. arXiv:1711.00199, 2017.
[5] ZAKHAROV S, SHUGUROV I, ILIC S. DPOD: 6D pose object detector and refiner[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019: 1941-1950.
[6] WANG C, XU D, ZHU Y, et al. DenseFusion: 6D object pose estimation by iterative dense fusion[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 2019: 3338-3347.
[7] HE Y, SUN W, HUANG H, et al. PVN3D: a deep point-wise 3D keypoints voting network for 6dof pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 11629-11638.
[8] HE Y, HUANG H, FAN H, et al. FFB6D: a full flow bidirectional fusion network for 6D pose estimation[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 2021: 3002-3012.
[9] CHEN J, KAO S, HE H, et al. Run, don’t walk: chasing higher FLOPS for faster neural networks[C]//Proceedings of the 2023 IEEE/CVF conference on Computer Vision and Pattern Recognition, 2023: 12021-12031.
[10] YU W, LUO M, ZHOU P, et al. MetaFormer is actually what you need for vision[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 10809-10819.
[11] CALLI B, SINGH A, WALSMAN A, et al. The YCB object and model set: towards common benchmarks for manipulation research[C]//Proceedings of the 2015 International Conference on Advanced Robotics, Istanbul, 2015: 510-517.
[12] HINTERSTOISSER S, LEPETIT V, ILIC S, et al. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes[C]//Proceedings of the 11th Asian Conference on Computer Vision. Berlin, Heidelberg: Springer, 2013: 548-562.
[13] KEHL W, MANHARDT F, TOMBARI F, et al. SSD-6D: making RGB-based 3D detection and 6D pose estimation great again[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, 2017: 1530-1538.
[14] DO T T, CAI M, PHAM T, et al. Deep-6DPose: recovering 6D object pose from a single RGB image[J]. arXiv:1802.10367, 2018.
[15] LI Z, WANG G, JI X. CDPN: coordinates-based disentangled pose network for real-time RGB-based 6D of object pose estimation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, 2019: 7677-7686.
[16] PARK K, MOUSAVIAN A, XIANG Y, et al. LatentFusion: end-to-end differentiable reconstruction and rendering for unseen object pose estimation[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 10707-10716.
[17] 梁达勇, 陈俊洪, 朱展模, 等. 多特征像素级融合的遮挡物体6DoF姿态估计研究[J]. 计算机科学与探索, 2020, 14(12): 2072-2082.
LIANG D Y, CHEN J H, ZHU Z M, et al. Research on occluded objects 6DoF pose estimation with multi-feature and pixel-level fusion[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(12): 2072-2082.
[18] SARODE V, LI X, GOFORTH H, et al. PCRNet: point cloud registration network using pointnet encoding[J]. arXiv:1908.
07906, 2019.
[19] CHEN W, JIA X, CHANG H J, et al. G2L-Net: global to local network for real-time 6D pose estimation with embedding vector features[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 4232-4241.
[20] AOKI Y, GOFORTH H, SRIVATSAN R A, et al. PointNetLK: robust & efficient point cloud registration using PointNet[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 2019: 7156-7165.
[21] GAO G, LAURI M, WANG Y, et al. 6D object pose regression via supervised learning on point clouds[C]//Proceedings of the 2020 IEEE International Conference on Robotics and Automation, Paris, 2020: 3643-3649.
[22] PHAM Q H, UY M A, HUA B S, et al. LCD: learned cross-domain descriptors for 2D-3D matching[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020: 11856-11864.
[23] LIU X, JONSCHKOWSKI R, ANGELOVA A, et al. KeyPose: multi-view 3D labeling and keypoint estimation for transparent objects[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 11599-11607.
[24] SONG C, SONG J, HUANG Q. HybridPose: 6D object pose estimation under hybrid representations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 428-437.
[25] HODAN T, BARATH D, MATAS J. EPOS: estimating 6D pose of objects with symmetries[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 11700-11709.
[26] 包志强, 邢瑜, 吕少卿, 等. 改进YOLO V2的6D目标姿态估计算法[J]. 计算机工程与应用, 2021, 57(9): 148-153.
BAO Z Q, XING Y, LYU S Q, et al. Improved YOLO V2 6D object pose estimation algorithm[J]. Computer Engineering and Applications, 2021, 57(9): 148-153.
[27] 李冬冬, 郑河荣, 刘复昌, 等. 结合掩码定位和漏斗网络的6D姿态估计[J]. 中国图象图形学报, 2022, 27(2): 642-652.
LI D D, ZHENG H R, LIU F C, et al. 6D pose estimation based on mask location and hourglass network[J]. Journal of Image and Graphics, 2022, 27(2): 642-652.
[28] ZENG A, SONG S, NIESSNER M, et al. 3DMatch: learning local geometric descriptors from RGB-D reconstructions[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 199-208.
[29] YEW Z J, LEE G H. 3DFeat-Net: weakly supervised local 3D features for point cloud registration[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 630-646.
[30] FISCHER K, SIMON M, OLSNER F, et al. StickyPillars: robust and efficient feature matching on point clouds using graph neural networks[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, 2021: 313-323.
[31] CRIVELLARO A, RAD M, VERDIE Y, et al. Robust 3D object tracking from monocular images using stable parts[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1465-1479.
[32] WADA K, SUCAR E, JAMES S, et al. MoreFusion: multi-object reasoning for 6D pose estimation from volumetric fusion[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 14528-14537.
[33] KUMAR A, SHUKLA P, KUSHWAHA V, et al. Context-aware 6D pose estimation of known objects using RGB-D data[J]. arXiv:2212.05560, 2022.
[34] JIANG X, LI D, CHEN H, et al. Uni6D: a unified CNN framework without projection breakdown for 6D pose estimation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 11164-11174.
[35] SUN M, ZHENG Y, BAO T, et al. Uni6Dv2: noise elimination for 6D pose estimation[C]//Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 2023: 1832-1844.
[36] PENG S, ZHOU X, LIU Y, et al. PVNet: pixel-wise voting network for 6DoF object pose estimation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(6): 3212-3223.
[37] 马康哲, 皮家甜, 熊周兵, 等. 融合注意力特征的遮挡物体6D姿态估计[J]. 计算机应用, 2022, 42(12): 3715-3722.
MA K Z, PI J T, XIONG Z B, et al. 6D pose estimation incorporating attentional features for occluded objects[J]. Journal of Computer Applications, 2022, 42(12): 3715-3722.
[38] GONZALEZ M, KACETE A, MURIENNE A, et al. L6DNet: light 6DoF network for robust and precise object pose estimation with small datasets[J]. IEEE Robotics and Automation Letters, 2021, 6(2): 2914-2921.
[39] WANG C, MARTIN-MARTIN R, XU D, et al. 6-PACK: category-level 6D pose tracker with anchor-based keypoints[C]//Proceedings of the 2020 IEEE International Conference on Robotics and Automation, Paris, 2020: 10059-10066.
[40] HOWARD A G, ZHU M, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[41] ZHANG X, ZHOU X, LIN M, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 6848-6856.
[42] HAN K, WANG Y, TIAN Q, et al. GhostNet: more features from cheap operations[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 1577-1586.
[43] TAN M, LE Q. EfficientNet: rethinking model scaling for convolutional neural networks[C]//Proceedings of the 36th International Conference on Machine Learning, 2019: 6105-6114.
[44] HAN K, WANG Y, ZHANG Q, et al. Model Rubik’s cube: twisting resolution, depth and width for TinyNets[C]//Advances in Neural Information Processing Systems 33, 2020: 19353-19364.
[45] GUO M H, CAI J X, LIU Z N, et al. PCT: point cloud transformer[J]. Computational Visual Media, 2021, 7(2): 187-199.
[46] ZHAO H, JIANG L, JIA J, et al. Point transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 16239-16248.
[47] ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 6230-6239.
[48] HU Q, YANG B, XIE L, et al. RandLA-Net: efficient semantic segmentation of large-scale point clouds[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 11105-11114.
[49] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009: 248-255.
[50] LIANG M, YANG B, WANG S, et al. Deep continuous fusion for multi-sensor 3D object detection[C]//Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 663-678.
[51] XU D, ANGUELOV D, JAIN A. PointFusion: deep sensor fusion for 3D bounding box estimation[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018: 244-253.