计算机工程与应用 ›› 2026, Vol. 62 ›› Issue (8): 34-47.DOI: 10.3778/j.issn.1002-8331.2506-0314

• 热点与综述 • 上一篇    下一篇

大模型时代下的注意力机制优化综述

史登辉1,2,奚雪峰1,2,3+,崔志明1,2,3,朱润4,王坚5   

  1. 1.苏州科技大学 电子与信息工程学院,江苏 苏州 215000
    2.苏州市虚拟现实智能交互及应用技术重点实验室,江苏 苏州 215000
    3.苏州科技大学 智慧城市研究院,江苏 苏州 215000
    4.昆山市数据局,江苏 昆山 215301
    5.昆山市公安局,江苏 昆山 215301
    + 通信作者 E-mail:xfxi@usts.edu.cn
  • 收稿日期:2025-06-26 修回日期:2025-10-28 在线发布日期:2026-04-15 出版日期:2026-04-15
  • 基金资助:
    国家自然科学基金(62176175,62372318);苏州市水利水务科技项目(2025004)。

Review of Attention Mechanism in Era of Large Language Models

SHI Denghui1,2, XI Xuefeng1,2,3+, CUI Zhiming1,2,3, ZHU Run4, WANG Jian5   

  1. 1.School of Electronic & Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu 215000, China
    2.Suzhou Key Laboratory of Virtual Reality Intelligent Interaction and Application Technology, Suzhou, Jiangsu 215000, China
    3.Suzhou Smart City Research Institute,Suzhou University of Science and Technology, Suzhou, Jiangsu 215000, China
    4.Data Bureau of Kunshan City, Kunshan, Jiangsu 215301, China
    5.Public Security Bureau of Kunshan City, Kunshan, Jiangsu 215301, China
    + Corresponding author E-mail:xfxi@usts.edu.cn
  • Received:2025-06-26 Revised:2025-10-28 Online:2026-04-15 Published:2026-04-15

摘要: 自从注意力机制(attention mechanism)提出后,基于注意力机制的Transformer架构很快确立了大模型的核心地位,大语言模型迎来了新的发展方向,推动了自然语言处理、计算机视觉等众多领域取得了丰厚的成果。近年来,随着大模型迅猛发展,模型的参数规模持续增长,传统的Transformer架构已经难以满足庞大的模型训练要求。除了堆积算力之外,模型架构的调整以及对于注意力机制的进一步挖掘是解决这一挑战的有效途径,成为了研究的热点。介绍传统的Transformer架构以及近年来对于Transformer架构及其变体的研究现状,分析其核心自注意力机制的原理及其面临的瓶颈。分析总结近年来注意力模块的改进,以DeepSeek为例,探索其爆火背后基于Transformer的MoE架构和多头潜在注意力机制(multi-head latent attention,MLA)的核心技术路径。总结优化注意力机制的研究现状,并展望未来研究方向。

关键词: 大语言模型(LLM), 注意力机制, 多头潜在注意力机制(MLA), MoE架构

Abstract: Since the attention mechanism was proposed, the Transformer architecture based on the attention mechanism has quickly established the core position of large models. The large language model has ushered in a new development direction and promoted fruitful results in many fields such as natural language processing and computer vision. In recent years, with the rapid development of large models, the parameter scale of the model has continued to grow, and the traditional Transformer architecture has been difficult to meet the requirements of large model training. In addition to accumulating computing power, the adjustment of the model architecture and further exploration of the attention mechanism are effective ways to solve this challenge and have become a hot topic of research. This paper first introduces the traditional Transformer architecture and the current status of research on the Transformer architecture and its variants in recent years, and analyzes the principles of its core self-attention mechanism and the bottlenecks it faces. Subsequently, the improvements of the attention module in recent years are analyzed and summarized. Then, taking DeepSeek as an example, the core technical path of the Transformer-based MoE architecture and the multi-head latent attention mechanism (MLA) behind its explosion are explored. Finally, the current status of research on optimizing the attention mechanism is summarized and future research directions are prospected.

Key words: large language model (LLM), attention mechanism, multi-head latent attention mechanism (MLA), MoE architecture