DPENet：Lightweight Document Pose Estimation Network

doi:10.3778/j.issn.1002-8331.2104-0312

Abstract

Abstract: Existing deep learning models for perspective skewed deformation document correction processing have problems of large number of model parameters, slow inference speed and poor spatial generalization. This paper introduces a pose estimation algorithm and proposes a lightweight document pose estimation network（DPENet） to cover the weakness. The model treats a single document in a document image as a pose estimation object, and treats the four corner points of a document as four pose estimation points of the document object, and uses DSNT（differentiable spatial to numerical transform） to predict coordinates of four pose estimation points, which has advantages of both full connection regression and heatmap regression, and achieves high-precision localization of document images corner points, and implements high-precision correction of the perspective deformed document image by perspective transformation processing. DPENet adopts lightweight design which uses MobileNet V2 as the backbone network, so that DPENet has a small volume which is only 10.6 MB. Compared with three mainstream networks on SmartDoc-QA （148 images）, the correction success rate （96.6%） and the mean displacement error（MDE）（1.28 pixels） of DPENet are better than the other three networks, while its average correction speed also has good performance. The DPENet has higher correction success rate and correction accuracy for deformed documents while maintaining light weight and fast speed.

Key words: pose estimation, deep learning, document image rectification, lightweight network, MobileNet V2

摘要： 现有的用于矫正透视倾斜变形文档的深度学习模型存在空间泛化性差、模型参数量大、推理速度慢等问题。从姿态估计的角度出发，提出一种轻量化文档姿态估计网络DPENet（lightweight document pose estimation network），以优化上述问题。将文档图像中的单一文档视为一个姿态估计对象，将文档的四个角点视为文档对象的四个姿态估计点，采用兼具全连接回归与高斯热图回归优点的DSNT（differentiable spatial to numerical transform）模块实现文档图像角点的高精度定位，并通过透视变换处理实现透视变形文档图像的高精度矫正。DPENet采用轻量化设计，以面向移动端的MobileNet V2为主干网络，模型体量只有10.6?MB。在SmartDoc-QA（仅取148张文档图像）数据集上与现有的三种主流网络进行了对比实验，实验结果表明，DPENet的矫正成功率（96.6%）和平均位移误差（mean displacement error，MDE）（1.28个像素）均优于其他三种网络，同时其平均矫正速度也有良好的表现。在保持轻量化和速度快的条件下，DPENet网络具有更高的变形文档矫正成功率和矫正精度。

关键词: 姿态估计, 深度学习, 文档图像矫正, 轻量化网络, MobileNet V2

HAN Jing, LYU Xueqiang, ZHANG Xiangxiang, HAO Wei, ZHANG Kai. DPENet：Lightweight Document Pose Estimation Network[J]. Computer Engineering and Applications, 2022, 58(22): 210-218.

韩晶, 吕学强, 张祥祥, 郝伟, 张凯. DPENet：轻量化文档姿态估计网络[J]. 计算机工程与应用, 2022, 58(22): 210-218.

References

[1] DAS S，MA K，SHU Z，et al.DewarpNet：single-image document unwarping with stacked 3D and 2D regression networks[C]//2019 IEEE/CVF International Conference on Computer Vision，Seoul，Oct 27-Nov 2，2019.Piscataway：IEEE，2019：131-140.
[2] 程雷雷.基于深度神经网络的形变中文文档矫正研究[D].青岛：青岛理工大学，2018.
CHENG L L.Research on deformed Chinese document correction based on deep neural network[D].Qingdao：Qingdao University of Technology，2018.
[3] ABBAS S A，HUSSAIN S U.Recovering homography from camera captured documents using convolutional neural networks[EB/OL].（2017-09-11）[2017-09-11].https：//arxiv.org/pdf/1709.03524.pdf.
[4] JAVED K，SHAGAIT F.Real-time document localization in natural Images by recursive application of a CNN[C]//2017 14th IAPR International Conference on Document Analysis and Recognition，Kyoto，Nov 9-15，2017.Piscataway：IEEE，2017：105-110.
[5] K?RBER N.Improving camera-based document analysis with deep learning[C]//2019 International Conference on Applied Informatics，Sibiu，May 16-18，2019.Madrid：ICDD，2019：159-171.
[6] CHOLLET F.Xception：deep learning with depthwise separable convolutions[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，Jul 21-26，2017.Piscataway：IEEE，2017：1800-1807.
[7] CAO Z，HIDALGO G，SIMON T，et al.OpenPose：realtime multi-person 2D pose estimation using part affinity fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，43（1）：172-186.
[8] SHAO Z W，ZHU H L，TAN X，et al.Deep multi-center learning for face alignment[J].Neurocomputing，2020，396：477-486.
[9] SIMON T，JOO H，MATTHEWS I，et al.Hand keypoint detection in single images using multiview bootstrapping[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，Honolulu，Jul 21-26，2017.Piscataway：IEEE，2017：4645-4653.
[10] CAO J K，TANG H Y，FANG H S，et al.Cross-domain adaptation for animal pose estimation[C]//2019 IEEE/CVF International Conference on Computer Vision，Seoul，Oct 27-Nov 2，2019.Piscataway：IEEE，2019：9497-9506.
[11] NIBALI A，HE Z，MORGAN S，et al.Numerical coordinate regression with convolutional neural networks[J].arXiv：1801.07372，2018.
[12] WEI S E，RAMAKRISHNA V，KANADE T，et al.Convolutional pose machines[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，Jun 27-30，2016.Piscataway：IEEE，2016：4724-4732.
[13] TOSHEY A，SZEGEDY C.DeepPose：human pose estimation via deep neural networks[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition，Columbus，Jun 23-28，2014：1653-1660.
[14] SANDLER M，HOWARD A，ZHU M，et al.MobileNetV2：inverted residuals and linear bottlenecks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Salt Lake City，Jun 18-23，2018.Piscataway：IEEE，2018：4510-4520.
[15] 兰天翔，向子彧，刘名果，等.融合U-Net及MobileNet-V2的快速语义分割网络[J].计算机工程与应用，2021，57（17）：175-180.
LAN T X，XIANG Z Y，LIU M G，et al.Quick semantic segmentation network based on U-Net and MobileNet-V2[J].Computer Engineering and Applications，2021，57（17）：175-180.
[16] 柳长源，王琪，毕晓君.多目标小尺度车辆目标检测方法[J].控制与决策，2021，36（11）：2707-2712.
LIU C Y，WANG Q，BI X J.Multi-target and small-scale vehicle target detection method[J].Control and Decision，2021，36（11）：2707-2712.
[17] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，Las Vegas，Jun 27-30，2016.Piscataway：IEEE，2016：770-778.
[18] GLOROT X，BORDES A，BENGIO Y.Deep sparse rectifier neural networks[J].Journal of Machine Learning Research，2011，15：315-323.
[19] RUDER S.An overview of gradient descent optimization algorithms[J].arXiv：1609.04747，2016.
[20] WANG P，CHEN P，YUAN Y，et al.Understanding convolution for semantic segmentation[C]//2018 IEEE Winter Conference on Applications of Computer Vision，Lake Tahoe，Mar 12-15，2018.Piscataway：IEEE，2018：1451-1460.
[21] NAYEF N，LUQMAN M M，PRUM S，et al.SmartDoc-QA：a dataset for quality assessment of smartphone captured document images-single and multiple distortions[C]//13th International Conference on Document Analysis and Recognition，Tunis，Aug 23-26，2015.Piscataway：IEEE，2015：1231-1235.
[22] KRIZHEVSKY A，SUTSKEVER I，HINTON G E.ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25，2012：1106-1114.
[23] 张祥祥，吕学强，韩晶，等.TIMR：模板图像匹配矫正[J].小型微型计算机系统，2022，43（4）：807-814.
ZHANG X X，LV X Q，HAN J，et al.TIMR：template image matching rectification[J].Journal of Chinese Computer Systems，2022，43（4）：807-814.