计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (21): 144-156.DOI: 10.3778/j.issn.1002-8331.2503-0235

• 理论与研发 • 上一篇    下一篇

端云协同离在线强化学习方法及其在兵棋上的应用

施伟,黄红蓝,梁星星,程光权,郑臻哲   

  1. 1.国防科技大学 系统工程学院 大数据与决策实验室,长沙 410073 
    2.火箭军工程大学 核工程学院,西安 710025
    3.上海交通大学 电子信息与电气工程学院,上海 200240
  • 出版日期:2025-11-01 发布日期:2025-10-31

Device-Cloud Collaborative Offline-to-Online Reinforcement Learning Method and Its Application in Wargame

SHI Wei, HUANG Honglan, LIANG Xingxing, CHENG Guangquan, ZHENG Zhenzhe   

  1. 1.Laboratory for Big Data and Decision, College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
    2.College of Nuclear Engineering, Rocket Force University of Engineering, Xi’an 710025, China
    3.School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
  • Online:2025-11-01 Published:2025-10-31

摘要: 随着军事智能化技术演进,兵棋推演智能决策研究备受关注。针对传统云端集中式决策模式存在的通信延迟、数据安全风险和部署壁垒等问题,提出端云协同混合离在线强化学习框架(Decider),实现基于先验知识与试错数据的融合驱动决策。云端动态筛选高价值样本传输至边缘设备,缓解数据分布偏移问题,加速策略搜索;引入历史动量聚合算法,稳定模型训练。在海空对抗兵棋实验中,Decider策略搜索速度提升超过90%,平均对抗得分提高26%以上,且通信负载降至传统架构的9.5%。

关键词: 端云协同, 强化学习, 离在线强化学习, 兵棋推演, 智能决策

Abstract: With the evolution of military intelligence technology, research on intelligent decision-making in wargame has gained significant attention. To address challenges including communication latency, data security risks, and deployment barriers in cloud-centric decision-making paradigms, Decider is proposed, a device-cloud collaborative hybrid offline-to-online reinforcement learning framework. Decider integrates offline pre-training with online learning, enabling decisions driven by the fusion of prior knowledge and trial-and-error data. The cloud server dynamically screens and transmits high-value samples to clients, mitigating data distribution shift issues and accelerating policy optimization. The historical momentum-based model aggregation method is introduced to stabilize training performance. Experimental results in a naval-air combat wargame scenario demonstrate that Decider achieves over 90% faster optimization speed, improves average scores by 26%, and reduces communication load to 9.5% of conventional architectures.

Key words: device-cloud collaboration, reinforcement learning, offline-to-online reinforcement learning, wargame, intelligent decision-making