计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (16): 196-204.DOI: 10.3778/j.issn.1002-8331.2410-0268

• 模式识别与人工智能 • 上一篇    下一篇

改进Transformer模型的语音识别轻量化设计

王艳红,赵亮,王官军   

  1. 1.西南科技大学 信息工程学院,四川 绵阳 621010 
    2.浙江芯劢微电子股份有限公司 成都分公司,成都 610041
  • 出版日期:2025-08-15 发布日期:2025-08-15

Improved Speech Recognition Lightweight Design for Transformer Model

WANG Yanhong, ZHAO Liang, WANG Guanjun   

  1. 1.School of Information Engineering, Southwest University of Science and Technology, Mianyang, Sichuan 621010, China
    2.Chengdu Branch, Zhejiang Core Microelectronics Co., Ltd., Chengdu 610041, China
  • Online:2025-08-15 Published:2025-08-15

摘要: 语音识别作为AI的重要应用分支近年来取得了显著的成果,其中基于Transformer模型的语音识别发展尤为突出。然而Transformer模型较大的参数量和较高的计算复杂度,难以在边缘设备上实现部署,因此设计轻量化的Transformer模型用于语音识别部署是亟需解决的问题。设计了一种轻量化Transformer模型,通过将Query、Key、Value的线性操作替换为轻量级卷积操作,优化多头注意力机制以改善注意力分布,并在前馈神经网络中引入分块低秩分解以最大化模型压缩。实验结果表明,在AISHELL-1和LRS2数据集上,该模型在同等条件下模型大小减少68.03%,参数量减少67.06%,错词率相对降低23.19%。

关键词: Transformer, 语音识别, 轻量化, 模型压缩, 深度学习

Abstract: As an important application branch of AI, speech recognition has achieved remarkable results in recent years, among which the development of speech recognition based on Transformer model is particularly prominent. However, due to the large number of parameters and high computational complexity, it is difficult to deploy the Transformer model on edge devices, so it is urgent to design a lightweight Transformer model for speech recognition deployment. This paper designs a lightweight Transformer model, which optimizes the multi-head attention mechanism to improve the attention distribution by replacing the linear operations of Query, Key and Value with lightweight convolutional operations, and introduces chunked low-rank decomposition in the feedforward neural network to maximize model compression. Experimental results show that on the AISHELL-1 and LRS2 datasets, the model size is reduced by 68.03%, the number of parameters is reduced by 67.06%, and the error rate is reduced by 23.19% under the same conditions.

Key words: Transformer, speech recognition, lightweight, model compression, deep learning