基于DMN的跨模态目标实例分割方法

doi:10.3778/j.issn.1002-8331.2102-0280

摘要/Abstract

摘要： 在DMN的基础上提出一种跨模态目标实例分割方法，旨在结合自然语言表达，利用不同模态信息从图像中分割所描述对象。在视觉特征提取网络DPN92中引入CBAM注意力机制，关注空间和通道上的有用信息；将BN层替换为联合BN和FRN的正则化，减少批次量和通道数对提取特征网络性能的影响，提高网络的泛化能力；在三个通用数据集ReferIt、GRef和UNC上进行仿真实验。实验结果显示，提出的引入CBAM注意力机制和联合正则化改进模型在mIou评价指标上，ReferIt和GRef上分别提升了1.85和0.52个百分点，在UNC三个验证集上分别提升了1.98、2.22和2.75个百分点。表明改进模型在预测准确度方面优于已有模型。

关键词: 跨模态, 自然语言处理, 目标实例分割, 注意力机制, 联合正则化

Abstract: A cross-modal target instance segmentation method based on DMN, which aims to segment the objects described by natural language expression from the image, is proposed in this paper. First of all, the CBAM attention mechanism is introduced in the visual feature extraction network DPN92, which pays attention to the useful information in space and channel. Secondly, the BN layer is replaced with the normalization of the union of BN and FRN, which reduces the influence batch volume and number of channels in the performance of the extraction characteristic network, and improves the generalization ability of the network. Finally, the proposed scheme is simulated based on three common datasets, ReferIt, GRef and UNC. Simulation results indicate that the mIou evaluation index, which the introduction of CBAM attention mechanism and the joint normalization model, is improved by 1.85 and 0.52 percentage points respectively on the formal two datasets, and is improved by 1.98, 2.22 and 2.75 percentage points on the three validation sets split by UNC, and the improved model is better than the existing model.

Key words: cross-modal, natural language processing, target instance segmentation, attention mechanisms, union normalization

熊珺瑶, 宋振峰, 王蓉. 基于DMN的跨模态目标实例分割方法[J]. 计算机工程与应用, 2022, 58(20): 117-123.

XIONG Junyao, SONG Zhenfeng, WANG Rong. Cross-Modal Target Instance Segmentation Method Based on DMN[J]. Computer Engineering and Applications, 2022, 58(20): 117-123.

参考文献

[1] 田国会，刘浩鹏，部俊峰.基于自然语言表达的目标检测算法[J].华中科技大学学报（自然科学版），2017，45（10）：111-116.
TIAN G H，LIU H P，BU J F.Object detection algorithm based on natural language expression[J].Journal of Huazhong University of Science and Technology（Nature Science Edition），2017，45（10）：111-116.
[2] HARIHARAN B，ARBEL’AEZ P，GIRSHICK R，et al.Simultaneous detection and segmentation[C]//European Conference on Computer Vision（ECCV）.Berlin，German：Springer Press，2014：297-312.
[3] HU Y，ZHAO C X，GUO Z B，et al.Improved relative entropy- based thresholding algorithm for segmentation[J].Journal of System Simulation，2009，21（12）：3731-3733.
[4] HINTON G E，SALAKHUTDINOVR R.Reducing the dimensionality of data with neural networks[J].Science，2006，313（5786）：504-507.
[5] GIRSHICK R，DONAHUE J，DAEEELL T，et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway，NJ：IEEE Press，2014：580-587.
[6] GIRSHICK R.Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision（ICCV）.Piscataway，NJ：IEEE Press，2015：1440-1448.
[7] REN S，HE K，GIRSHICK R，et al.Faster R-CNN：towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence，2017，39（6）：1137-1149.
[8] HU R，XU H，ROHRBACH M，et al.Natural language object retrieval[C]//Conference on Computer Vision and Pattern Recognition（CVPR）.Piscataway，NJ：IEEE Press，2016：4555-4564.
[9] LU J，BATRA D，PARIKH D，et al.ViLBERT：pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL].[2021-02-02].https：//arxiv.org/abs/1908.02265.
[10] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[EB/OL].[2021-02-02].https：//arxiv.org/abs/1810.04805.
[11] YU L C，LIN Z，SHEN X H，et al.Mattnet：modular attention network for referring expression comprehension[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition（CVPR）.Piscataway，NJ：IEEE Press，2018：1307-1315.
[12] MARGFFOY-TUAY E，PéREZ J C，BOTERO E，et al.Dynamic multimodal instance segmentation guided by natural language queries[C]//Proceedings of the European Conference on Computer Vision（ECCV）.Berlin，German：Springer Press，2018：630-645.
[13] SINGH S，KRISHNAN S.Filter response normalization layer：eliminating batch dependence in the training of deep neural networks[C]//Conference on Computer Vision and Pattern Recognition（CVPR）.Piscataway，NJ：IEEE Press，2020：11237-11246.
[14] LOFFE S，SZEGEDY C.Batch normalization：accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning（PMLR）.New York，NY：ACM Press，2015：448-456.
[15] CHEN Yanpeng，LI Jianan，XIAO Huaxin，et al.Dual path networks[EB/OL].[2021-02-02].https：//arxiv.org/abs/1707.01629.
[16] SHELHAMER E，LONG J，DARRELL T.Fully convolutional networks for semantic segmentation[C]//Conference on Computer Vision and Pattern Recognition（CVPR）.Piscataway，NJ：IEEE Press，2015：3431-3440.
[17] ZEILER M D，FERGUS R.Visualizing and understanding convolutional networks[C]//European Conference on Computer Vision（ECCV）.Berlin，German：Springer Press，2014：818-833.
[18] RONNEBERGER O，FISCHER P，BROX T.U-net：convolutional networks for biomedical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Berlin，German：Springer Press，2015：234-241.
[19] GLOROT X，BORDES A，BENIGIO Y.Deep sparse rectifier neural networks[C]//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics（AISTATS），2011：315-323.
[20] WOO S，PARK J，LEE J，et al.CBAM：convolutional block attention module[C]//European Conference on Computer Vision（ECCV）.Berlin，German：Springer Press，2018：3-19.
[21] KAZENZADEH S，ORDONEZ V，MATTEN M，et al.Referit game：referring to objects in photographs of natural scenes[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing（EMNLP），2014：787-798.
[22] MAO J H，HUANG J，TOSHEV A，et al.Generation and comprehension of unambiguous object descriptions[C]//Conference on Computer Vision and Pattern Recognition（CVPR），2016：11-20.
[23] YU L C，POIRSON P，YANG S，et al.Modeling context in referring expressions[C]//European Conference on Computer Vision.Berlin，German：Springer Press，2016：69-85.
[24] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：common objects in context[C]//European Conference on Computer Vision.Berlin，German：Springer Press，2014：740-755.
[25] KINGMA D，BA J.Adam：a method for stochastic optimization[EB/OL].[2021-02-02].https：//arxiv.org/abs/1412.6980.
[26] HU R，ROHRBACH M，DARRELL T.Segmentation from natural language expres-sions[C]//European Conference on Computer Vision.Berlin，German：Springer Press，2016：108-124.
[27] HU R，ROHRBACH M，VENUGOPALAN S，et al.Utilizing large scale vision and text datasets for image segmentation from referring expressions[EB/OL].[2021-02-02].https：//arxiv.org/abs/1608.08305.