计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (21): 83-90.DOI: 10.3778/j.issn.1002-8331.2301-0064

• 理论与研发 • 上一篇    下一篇

多任务下的特征分布蒸馏算法研究

葛海波,周婷,黄朝锋,李强   

  1. 西安邮电大学 电子工程学院,西安 710000
  • 出版日期:2023-11-01 发布日期:2023-11-01

Research on Feature Distribution Distillation Algorithm Under Multiple Tasks

GE Haibo, ZHOU Ting, HUANG Chaofeng, LI Qiang   

  1. School of Electronic Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710000, China
  • Online:2023-11-01 Published:2023-11-01

摘要: 卷积神经网络性能的快速提升是以不断堆叠的网络层数以及成倍增长的参数量和存储空间为代价,这不仅会使模型在训练过程中出现过拟合等问题,也不利于模型在资源受限的嵌入式设备上运行,因而提出模型压缩技术来解决上述问题,主要对模型压缩技术中的特征蒸馏算法进行了研究。针对特征蒸馏中利用教师网络特征图指导学生网络并不能很好地锻炼学生网络特征拟合能力的问题,提出基于特征分布蒸馏算法。该算法利用条件互信息的概念构建模型特征空间的概率分布,并引入最大平均差异(maximum mean discrepancy,MMD)设计损失函数以最小化教师网络和学生网络特征分布间的距离。在知识蒸馏的基础上利用toeplitz矩阵对学生网络进行权重共享操作,进一步节省了模型的存储空间。为验证在特征分布蒸馏算法训练下学生网络的特征拟合能力,在图像分类、目标检测和语义分割三种图像处理任务上进行了实验验证,实验表明所提算法在以上三种学习任务中的表现均优于对比算法且实现了不同网络架构间的蒸馏。

关键词: 特征分布蒸馏, 条件互信息, 特征分布, 最大平均差异(MMD), toeplitz矩阵

Abstract: The rapid improvement of the performance of convolutional neural networks is at the cost of continuously stacking the number of network layers, as well as multiplying the amount of parameters and storage space. This not only causes problems such as overfitting in the training process of the model, but also is unfavorable for the model to run on resource-constrained embedded devices. Therefore, model compression technology is proposed to solve the above problems. This article mainly studies the feature distillation algorithm in model compression technology. Aiming at the fact that using the feature map of the teacher network to guide the student network in feature distillation can not exercise the feature fitting ability of the student network very well, a distillation algorithm based on feature distribution is proposed. The algorithm uses the concept of conditional mutual information to construct the probability distribution of the feature space of the model, and introduces the maximum mean discrepancy(MMD) to design the MMD loss function to minimize the distance between the feature distribution of the teacher network and the student network. Finally, on the basis of knowledge distillation, the toeplitz matrix is used to share the weight of the student network, which further saves the storage space of the model. In order to verify the feature fitting ability of the student network under the training of the feature distribution distillation algorithm, this paper has carried out experimental verification on three image processing tasks:image classification, target detection and semantic segmentation. The experiment shows that the performance of the proposed algorithm is better than the comparison algorithm in the above three learning tasks and achieves distillation between different network architectures.

Key words: characteristic distributed distillation, conditional mutual information, distribution of characteristics, maximum mean discrepancy(MMD), toeplitz matrix