Research on Feature Distribution Distillation Algorithm Under Multiple Tasks

doi:10.3778/j.issn.1002-8331.2301-0064

Abstract

Abstract: The rapid improvement of the performance of convolutional neural networks is at the cost of continuously stacking the number of network layers, as well as multiplying the amount of parameters and storage space. This not only causes problems such as overfitting in the training process of the model, but also is unfavorable for the model to run on resource-constrained embedded devices. Therefore, model compression technology is proposed to solve the above problems. This article mainly studies the feature distillation algorithm in model compression technology. Aiming at the fact that using the feature map of the teacher network to guide the student network in feature distillation can not exercise the feature fitting ability of the student network very well, a distillation algorithm based on feature distribution is proposed. The algorithm uses the concept of conditional mutual information to construct the probability distribution of the feature space of the model, and introduces the maximum mean discrepancy（MMD） to design the MMD loss function to minimize the distance between the feature distribution of the teacher network and the student network. Finally, on the basis of knowledge distillation, the toeplitz matrix is used to share the weight of the student network, which further saves the storage space of the model. In order to verify the feature fitting ability of the student network under the training of the feature distribution distillation algorithm, this paper has carried out experimental verification on three image processing tasks：image classification, target detection and semantic segmentation. The experiment shows that the performance of the proposed algorithm is better than the comparison algorithm in the above three learning tasks and achieves distillation between different network architectures.

Key words: characteristic distributed distillation, conditional mutual information, distribution of characteristics, maximum mean discrepancy（MMD）, toeplitz matrix

摘要： 卷积神经网络性能的快速提升是以不断堆叠的网络层数以及成倍增长的参数量和存储空间为代价，这不仅会使模型在训练过程中出现过拟合等问题，也不利于模型在资源受限的嵌入式设备上运行，因而提出模型压缩技术来解决上述问题，主要对模型压缩技术中的特征蒸馏算法进行了研究。针对特征蒸馏中利用教师网络特征图指导学生网络并不能很好地锻炼学生网络特征拟合能力的问题，提出基于特征分布蒸馏算法。该算法利用条件互信息的概念构建模型特征空间的概率分布，并引入最大平均差异（maximum mean discrepancy，MMD）设计损失函数以最小化教师网络和学生网络特征分布间的距离。在知识蒸馏的基础上利用toeplitz矩阵对学生网络进行权重共享操作，进一步节省了模型的存储空间。为验证在特征分布蒸馏算法训练下学生网络的特征拟合能力，在图像分类、目标检测和语义分割三种图像处理任务上进行了实验验证，实验表明所提算法在以上三种学习任务中的表现均优于对比算法且实现了不同网络架构间的蒸馏。

关键词: 特征分布蒸馏, 条件互信息, 特征分布, 最大平均差异（MMD）, toeplitz矩阵

GE Haibo, ZHOU Ting, HUANG Chaofeng, LI Qiang. Research on Feature Distribution Distillation Algorithm Under Multiple Tasks[J]. Computer Engineering and Applications, 2023, 59(21): 83-90.

葛海波, 周婷, 黄朝锋, 李强. 多任务下的特征分布蒸馏算法研究[J]. 计算机工程与应用, 2023, 59(21): 83-90.

References

[1] CHEN J Y，LIN X，GAO S T D.A fast evolutionary learning to optimize CNN[J].Chinese Journal of Electronics，2020，29（6）：1061-1073.
[2] ZHANG L，HUANG S，LIU W.Intra-class part swapping for fine-grained image classification[C]//2021 IEEE Winter Conference on Applications of Computer Vision（WACV），2021：3208-3217.
[3] GONG X，XIA X，ZHU W，et al.Deformable gabor feature networks for biomedical image classification[C]//2021 IEEE Winter Conference on Applications of Computer Vision（WACV），2021：4003-4011.
[4] BUDACK E，SPRINGSTEIN M，HAKIMOV S，et al.Ontology-driven event type classification in images[C]//2021 IEEE Winter Conference on Applications of Computer Vision（WACV），2021：2927-2937.
[5] GE Z，LIU S，LI Z，et al.OTA：optimal transport assignment for object detection[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），Virtual，2021：303-312.
[6] WANG C Y，BOCHKOVSKIY A，LIAO H Y.Scaled-YOLOv4：scaling cross stage partial network[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition（CVPR），2021：13024-13033.
[7] SU T，LIANG Q，ZHANG J，et al.Attention-based feature interaction for efficient online knowledge distillation[C]//2021 IEEE International Conference on Data Mining（ICDM），2021：579-588.
[8] 唐武海，董博，陈华，等.深度神经网络模型压缩方法综述[J].智能物联技术，2021，53（6）：1-15.
TANG W H，DONG B，CHEN H，et al.Survey of model compression methods for deep neural networks[J].Technology of IoT & AI，2021，53（6）：1-15.
[9] CHOUDHARY T，MISHRA V，GOSWAMI A，et al.A comprehensive survey on model compression and acceleration[J].Artificial Intelligence Review，2020，53（3）：5113-5155.
[10] HINTON G，VINYALS O，DEAN J.Distilling the knowledge in a neural network[J].Computer Science，2015，14（7）：38-39.
[11] ROMERO A，BALLAS N，KAHOU S E，et al.FitNets：hints for thin deep nets[C]//International Conference on Learning Representations（ICLR），San Diego，CA，USA，2015：1018-1039.
[12] ZAGORUYKO S，KOMODAKIS N.Paying more attention to attention：improving the performance of convolutional neural networks via attention transfer[C]//International Conference on Learning Representations（ICLR），San Diego，CA，USA，2017：573-587.
[13] LIU Y，ZHANG W，WANG J.Adaptive multi-teacher multi-level knowledge distillation[J].Neurocomputing，2020，4（15）：106-113.
[14] INSEOP C，SEONGU P，NOJUN J.Feature-map-level online adversarial knowledge distillation[C]//International Conference on Machine Learning（ICML）.New York：ACM，2020：2006-2015.
[15] WEI L，DRAGOMIR A，DUMITRU E，et al.SSD：single shot multibox detector[C]//European Conference on Computer Vision，Amsterdam，the Netherlands，2016：21-37.