
计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (22): 20-35.DOI: 10.3778/j.issn.1002-8331.2502-0161
问佳琳,李晓军,姚俊萍,辜弘炀
出版日期:2025-11-15
发布日期:2025-11-14
WEN Jialin, LI Xiaojun, YAO Junping, GU Hongyang
Online:2025-11-15
Published:2025-11-14
摘要: 大语言模型近年来在自然语言处理等领域取得了显著成果,混合专家模型通过稀疏激活的策略减少了大语言模型的计算需求。随着混合专家模型所面临的推理任务越发复杂,部署于终端设备上的专家模型常常面临资源需求超出节点算力的问题,因此算力约束下的混合专家模型计算优化成为领域研究持续关注的热点问题。介绍了混合专家模型的概念与架构,并从门控网络、专家结构及模型、内存管理三个维度出发,对相关优化方法展开分类综述。在门控网络层面,研究了路由设计、损失函数优化和负载均衡机制,从而实现了精确路由;在专家结构层面,总结了各类专家设计、预处理方法和专家合并策略的结构创新;在内存管理层面,综述了现有的参数压缩和内存卸载技术,以应对模型在部署时面临的资源受限问题。分析了不同维度下计算优化的原理、策略及主要技术挑战,提出了领域研究需要关注的重点问题及潜在研究机会。
问佳琳, 李晓军, 姚俊萍, 辜弘炀. 算力约束下混合专家模型计算优化方法:现状及研究进展[J]. 计算机工程与应用, 2025, 61(22): 20-35.
WEN Jialin, LI Xiaojun, YAO Junping, GU Hongyang. Algorithm Optimization Method for Mixture of Experts Under Computational Power Constraints: Status and Progress[J]. Computer Engineering and Applications, 2025, 61(22): 20-35.
| [1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017: 5998-6008. [2] 屠要峰, 黄卫东. 大模型技术发展趋势及应用[J]. 信息通信技术, 2024, 18(3): 42-49. TU Y F, HUANG W D. Development trend and application of large model technology[J]. Information and Communic-ations Technologies, 2024, 18(3): 42-49. [3] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 665-689. [4] 曾楠, 谢志鹏. 基于混合专家模型的词语上下位关系判别方法[J]. 计算机科学, 2023, 50(2): 285-291. ZENG N, XIE Z P. Mixture-of-experts model for hypernymy discrimination[J]. Computer Science, 2023, 50(2): 285-291. [5] JIANG M R, ROTH H R, LI W Q, et al. Fair federated medical image segmentation via client contribution estimation[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 16302-16311. [6] ZHOU H Y, JIANG F, LU H T. SSDA-YOLO: semi-supervised domain adaptive YOLO for cross-domain object detection[J]. Computer Vision and Image Understanding, 2023, 229: 103649. [7] RIQUELME C, PUIGCERVER J, MUSTAFA B, et al. Scaling vision with sparse mixture of experts[C]//Advances in Neural Information Processing Systems 34, 2021: 8583-8595. [8] FEDUS W, DEAN J, ZOPH B. A review of sparse expert models in deep learning[J]. arXiv:2209.01667, 2022. [9] 史宏志, 赵健, 赵雅倩, 等. 大模型时代的混合专家系统优化综述[J]. 计算机研究与发展, 2025, 62(5): 1164-1189. SHI H Z, ZHAO J, ZHAO Y Q, et al. Survey on system optimization for mixture of experts in the era of large models[J]. Journal of Computer Research and Development, 2025, 62(5): 1164-1189. [10] 籍欣萌, 昝红英, 崔婷婷, 等. 大模型在垂直领域应用的现状与挑战[J]. 计算机工程与应用, 2025, 61(12): 1-11. JI X M, ZAN H Y, CUI T T, et al. Status and challenges of large language models applications in vertical domains[J]. Computer Engineering and Applications, 2025, 61(12): 1-11. [11] JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79-87. [12] MCKINZIE B, GAN Z, FAUCONNIER J P, et al. MM1: methods, analysis and insights from multimodal LLM pre-training[C]//Proceedings of the 18th European Conference on Computer Vision. Cham: Springer, 2024: 304-323. [13] WU J L, HU X, WANG Y Q, et al. Omni-SMoLA: boosting generalist multimodal models with soft mixture of low-rank experts[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 14205-14215. [14] KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[J]. arXiv:2001.08361, 2020. [15] KRAJEWSKI J, LUDZIEJEWSKI J, ADAMCZEWSKI K, et al. Scaling laws for fine-gr-ained mixture of experts[C]//Proceedings of the 41st International Conference on Machine Learning, 2024. [16] 杨程, 车文刚. 基于多门混合专家网络的情感分析与文本摘要多任务模型[J]. 现代电子技术, 2024, 47(1): 94-99. YANG C, CHE W G. Sentiment analysis and text summarization multi-task model based on multi-gate mixture-of-experts network[J]. Modern Electronics Technique, 2024, 47(1): 94-99. [17] MUSTAFA B, RIQUELME C, PUIGCERVER J, et al. Multi-modal contrastive learning with LIMoE: the language-image mixture of experts[J]. arXiv:2206.02770, 2022. [18] 王玲, 王明慧, 王鹏, 等. 融合多尺度特征的知识蒸馏工业缺陷检测算法[J/OL]. 计算机工程与应用 [2025-01-27]. https://link.cnki.net/urlid/11.2127.tp.20241127.1746.018. WANG L, WANG M H, WANG P, et al. Industrial defect detection algorithm with knowledge distillation integrating multi-scale features[J/OL]. Computer Engineering and Applications [2025-01-27]. https://link.cnki.net/urlid/11.2127.tp. 20241127.1746.018. [19] DONG X B, YU Z W, CAO W M, et al. A survey on ensemble learning[J]. Frontiers of Computer Science, 2020, 14(2): 241-258. [20] LEPIKHIN D, LEE H, XU Y Z, et al. GShard: scaling giant models with conditional computation and automatic sharding[J]. arXiv:2006.16668, 2020. [21] 周程阳, 刘伟, 吴天润, 等. 基于混合专家模型的岩石薄片图像分类[J]. 吉林大学学报(理学版), 2024, 62(4): 905-914. ZHOU C Y, LIU W, WU T R, et al. Classification of rock thin section images based on mixture of expert model[J]. Journal of Jilin University (Science Edition), 2024, 62(4): 905-914. [22] 姚福星, 孙超, 兰云港, 等. 基于混合专家模型的智能网联汽车换道决策方法[J]. 汽车工程, 2024, 46(5): 882-892. YAO F X, SUN C, LAN Y G, et al. A lane change decision method for intelligent connected vehicles based on mixture of expert model[J]. Automotive Engineering, 2024, 46(5): 882-892. [23] XU J H, SUN L, ZHAO D J. MoME: mixture-of-masked-experts for efficient multi-task recommendation[C]//Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2024: 2527-2531. [24] ALJUNDI R, CHAKRAVARTY P, TUYTELAARS T. Expert gate: lifelong learning with a network of experts[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 7120-7129. [25] SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer[J]. arXiv:1701.06538, 2017. [26] NGUYEN H, NGUYEN T, HO N. Demystifying softmax gating function in Gaussian mixture of experts[C]//Advances in Neural Information Processing Systems 36, 2023. [27] LIU J C, WANG J H, JIANG Y M. Janus: a unified distributed training framework for sparse mixture-of-experts models[C]//Proceedings of the ACM SIGCOMM 2023 Conference. New York: ACM, 2023: 486-498. [28] FEDUS W, ZOPH B, SHAZEER N. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(1): 417-435. [29] DU N, HUANG Y P, DAI A M, et al. GLaM: efficient scaling of language models with mixture-of-experts[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 5547-5569. [30] CLARK A, DE LAS CASAS D, GUY A, et al. Unified scaling laws for routed language models[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 127-155. [31] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//Proceedings of the 2020 International Conference for High Performance Computing, Networking, Storage and Analysis. Piscataway: IEEE, 2020: 1-16. [32] DUA D, BHOSALE S, GOSWAMI V, et al. Tricks for training sparse translation models[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2022: 3340-3345. [33] ZHOU Y, LEI T, LIU H, et al. Mixture-of-experts with expert choice routing[C]//Advances in Neural Information Processing Systems 35, 2022: 7103-7114. [34] ZHOU Y, DU N, HUANG Y, et al. Brainformers: trading simplicity for efficiency[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 42531-42542. [35] ROLLER S, SUKHBAATAR S, SZLAM A, et al. Hash layers for large sparse models[C]//Advances in Neural Inform-ation Processing Systems 34, 2021: 17555-17566. [36] ZUO S M, LIU X D, JIAO J, et al. Taming sparsely activated transformer with stochastic experts[J]. arXiv:2110.04260, 2021. [37] PUIGCERVER J, RIQUELME C, MUSTAFA B, et al. From sparse to soft mixtures of experts[J]. arXiv:2308.00951, 2023. [38] LEWIS M, BHOSALE S, DETTMERS T, et al. BASE layers: simplifying training of large, sparse models[C]//Proceedings of the 38th International Conference on Machine Learning, 2021: 101-139. [39] ARTETXE M, BHOSALE S, GOYAL N, et al. Efficient large scale language modeling with mixtures of experts[J]. arXiv:2112.10684, 2021. [40] 叶进, 李温良, 余天添, 等. 面向混合专家模型的流行专家预取策略[J]. 小型微型计算机系统, 2025, 46(7): 1760-1766. YE J, LI W L, YU T T, et al. Popular expert prefetching strategies for mixture of expert models[J]. Journal of Chinese Computer Systems, 2025, 46(7): 1760-1766. [41] ZOPH B. Designing effective sparse expert models[C]//Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium Workshops. Piscataway: IEEE, 2022: 1044. [42] LI J M, JIANG Y M, ZHU Y B, et al. Accelerating distributed MoE training and inference with Lina[C]//Proceedings of the 2023 USENIX Annual Technical Conference, 2023: 945-959. [43] PAN X L, LIN W X, SHI S H, et al. Parm: efficient training of large sparsely-activated models with dedicated schedules[C]//Proceedings of the 2024 IEEE Conference on Computer Communications. Piscataway: IEEE, 2024: 1880-1889. [44] HE J A, ZHAI J D, ANTUNES T, et al. FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models[C]//Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York: ACM, 2022: 120-134. [45] HWANG C, CUI W, XIONG Y F, et al. Tutel: adaptive mixture-of-experts at scale[J]. arXiv:2206.03382, 2022. [46] NIE X N, MIAO X P, WANG Z L, et al. FlexMoE: scaling large-scale sparse pre-trained model training via dynamic device placement[J]. Proceedings of the ACM on Management of Data, 2023, 1(1): 1-19. [47] ZHAI M, HE J, MA Z, et al. SmartMoE: efficiently training sparsely-activated models through combining offline and online parallelization[C]//Proceedings of the 2023 USENIX Annual Technical Conference, 2023: 961-975. [48] CAI W L, JIANG J Y, QIN L, et al. Shortcut-connected expert parallelism for accelerating mixture-of-experts[J]. arXiv:2404. 05019, 2024. [49] HUANG Y P, CHENG Y L, CHEN D H, et al. GPipe: efficient training of giant neural networks using pipeline parallelism[C]//Advances in Neural Information Processing Systems, 2019: 103-112. [50] WANG W, LAI Z Q, LI S W, et al. Prophet: fine-grained load balancing for parallel training of large-scale MoE models[C]//Proceedings of the 2023 IEEE International Conference on Cluster Computing. Piscataway: IEEE, 2023: 82-94. [51] ZHANG Y H, CAI R S, CHEN T L, et al. Robust mixture-of-expert training for convolutional neural networks[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 90-101. [52] GURURANGAN S, LEWIS M, HOLTZMAN A, et al. DEMix layers: disentangling domains for modular language modeling[C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2022: 5557-5576. [53] LI D C, MA Y Z, WANG N Z, et al. MixLoRA: enhancing large language models fine-tuning with LoRA-based mixture of experts[J]. arXiv:2404.15159, 2024. [54] LUO T X, LEI J H, LEI F Y, et al. MoELoRA: contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models[J]. arXiv:2402.12851, 2024. [55] ZHANG X F, SHEN Y K, HUANG Z Y, et al. Mixture of attention heads: selecting attention heads per token[J]. arXiv:2210.05144, 2022. [56] WANG A, SUN X W, XIE R B, et al. HMoE: heterogeneous mixture of experts for language modeling[J]. arXiv: 2408.10681, 2024. [57] LI Y X, JIANG S Y, HU B T, et al. Uni-MoE: scaling unified multimodal LLMs with mixture of experts[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(5): 3424-3439. [58] LIEBER O, LENZ B, BATA H, et al. Jamba: a hybrid transformer-mamba language model[J]. arXiv:2403.19887, 2024. [59] WEI T W, ZHU B, ZHAO L, et al. Skywork-MoE: a deep dive into training techniques for mixture-of-experts language models[J]. arXiv:2406.06563, 2024. [60] LO K M, HUANG Z Y, QIU Z H, et al. A closer look into mixture-of-experts in large language models[J]. arXiv:2406. 18219, 2024. [61] ZOPH B, BELLO I, KUMAR S, et al. ST-MoE: designing stable and transferable sparse expert models[J]. arXiv:2202. 08906, 2022. [62] JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts[J]. arXiv:2401.04088, 2024. [63] XUE F Z, ZHENG Z A, FU Y, et al. OpenMoE: an early effort on open mixture-of-experts language models[J]. arXiv:2402.01739, 2024. [64] DAI D M, DENG C Q, ZHAO C G, et al. DeepSeekMoE: towards ultimate expert specialization in mixture-of-experts language models[C]//Proceedings of the 62nd Annual Mee-ting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 1280-1297. [65] LIN B, TANG Z Y, YE Y, et al. MoE-LLaVA: mixture of experts for large vision-language models[J]. arXiv:2401. 15947, 2024. [66] TEAM N, COSTA-JUSSà M R, CROSS J, et al. No language left behind: scaling human-centered machine translation[J]. arXiv:2207.04672, 2022. [67] DU Z, LI S, WU Y, et al. SiDA: sparsity-inspired dataaware serving for efficient and scalable large mixture-of-experts models[C]//Proceedings of the 7th Annual Conference on Machine Learning and Systems, 2024: 224-238. [68] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020. [69] ALBOODY A, SLAMA R. Graph transformer mixture-of-experts (GTMoE) for 3D hand gesture recognition[C]//Proceedings of the 2024 Intelligent Systems Conference. Cham: Springer, 2024: 317-336. [70] ROYER A, KARMANOV I, SKLIAR A, et al. Revisiting single-gated mixtures of experts[J]. arXiv:2304.05497, 2023. [71] KIM Y J, AHMAD AWAN A, MUZIO A, et al. Scalable and efficient MoE training for multitask multilingual models[J]. arXiv:2109.10465, 2021. [72] ZHANG Z L, LIU X D, CHENG H, et al. Diversifying the expert knowledge for task-agnostic pruning in sparse mixture-of-experts[J]. arXiv:2407.09590, 2024. [73] LI P, ZHANG Z, YADAV P, et al. Merge, then compress: demystify efficient SMoE with hints from its routing policy[C]//Proceedings of the 12th International Conference on Lear-ning Representations, 2024. [74] RAJBHANDARI S, LI C L, YAO Z W, et al. DeepSpeed-MoE: advancing mixture-of-experts inference and training to power next-generation AI scale[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 111-162. [75] XUE F Z, SHI Z J, WEI F T, et al. Go wider instead of deeper[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(8): 8779-8787. [76] GAO Z F, LIU P, ZHAO W X, et al. Parameter-efficient mixture-of-experts architecture for pre-trained language models[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3263-3273. [77] KONG R, LI Y C, FENG Q T, et al. SwapMoE: serving off-the-shelf MoE-based large language models with tunable memory budget[J]. arXiv:2308.15030, 2023. [78] TANG P, LIU J C, HOU X F, et al. HOBBIT: a mixed precision expert offloading system for fast MoE inference[J]. arXiv:2411.01433, 2024. [79] HAN X M, WEI L H, DOU Z Y, et al. ViMoE: an empirical study of designing vision mixture-of-experts[J]. arXiv:2410. 15732, 2024. |
| [1] | 孙正阳, 杜晔, 李桂领, 黎妹红. 链路及节点多属性融合的低轨卫星网络路由算法[J]. 计算机工程与应用, 2024, 60(18): 266-274. |
| [2] | 刘晓宇, 夏立斌, 姜晓巍, 孙功星. 微服务架构磁带库存储系统设计与实现[J]. 计算机工程与应用, 2023, 59(15): 253-263. |
| [3] | 陈淑平, 周慧霖, 何王全, 漆锋滨. 用于超大Infiniband网络的负载均衡多播路由[J]. 计算机工程与应用, 2022, 58(5): 138-147. |
| [4] | 王保剑,胡大裟,蒋玉明. 改进A*算法在路径规划中的应用[J]. 计算机工程与应用, 2021, 57(12): 243-247. |
| [5] | 袁洋,叶峰,赖乙宗,赵雨亭. 结合负载均衡与A*算法的多AGV路径规划[J]. 计算机工程与应用, 2020, 56(5): 251-256. |
| [6] | 何锋,曾文,王秉钧. 并行实时测控数据存储系统设计与实现[J]. 计算机工程与应用, 2020, 56(23): 253-258. |
| [7] | 王莉,赵阿群,赵晨辉. 基于Fat-Tree的虚拟分片负载均衡算法[J]. 计算机工程与应用, 2020, 56(13): 93-99. |
| [8] | 秦峰,曾浩,林开东. 高负载场景下基于负载均衡的LLN路由协议[J]. 计算机工程与应用, 2020, 56(1): 121-126. |
| [9] | 朱瑞金,龚雪娇,唐 波. 分布式混合压缩感知无线传感器网络数据收集[J]. 计算机工程与应用, 2019, 55(6): 73-80. |
| [10] | 王 进1,李 琪2,黄家玮2. 路径差异敏感的包散射策略机制[J]. 计算机工程与应用, 2019, 55(5): 72-75. |
| [11] | 王珺,王梦林,王悦,刘俊杰. SDN数据中心网络基于流分类的负载均衡方案[J]. 计算机工程与应用, 2019, 55(24): 75-83. |
| [12] | 刘欢,房胜,李哲,赵晴. 基于多条带HEVC并行编码器的负载均衡算法[J]. 计算机工程与应用, 2019, 55(18): 180-188. |
| [13] | 谷 静,侯永平,张 新. 异构网中基于改进蝙蝠算法的动态CRE偏置选择[J]. 计算机工程与应用, 2019, 55(13): 93-99. |
| [14] | 王鹏辉,张 宁,肖明明. 基于节点重要度的路由选择与频谱分配算法[J]. 计算机工程与应用, 2019, 55(13): 106-111. |
| [15] | 李 虹,朱 璇,刘云飞,李骏慧. 低开销负载均衡的改进型TORA协议[J]. 计算机工程与应用, 2019, 55(11): 67-73. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||