计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (17): 329-336.DOI: 10.3778/j.issn.1002-8331.2407-0069

• 工程与应用 • 上一篇    下一篇

求解在线三维装箱问题的启发式深度强化学习算法

张长勇,姚凯超,张宇浩   

  1. 中国民航大学 电子信息与自动化学院,天津 300300
  • 出版日期:2025-09-01 发布日期:2025-09-01

Heuristic Deep Reinforcement Learning Algorithm for Solving Online 3D Bin Packing Problem

ZHANG Changyong, YAO Kaichao, ZHANG Yuhao   

  1. College of Electronic Information and Automation, Civil Aviation University of China, Tianjin 300300, China
  • Online:2025-09-01 Published:2025-09-01

摘要: 货物装载是物流运输过程中的关键一环,属于NP-Hard问题。为解决智慧物流领域货物“即到即码”的实时性问题,提出了一种候选启发式与深度强化学习相结合的在线三维装箱算法。将在线三维装箱表述为带约束的马尔科夫决策过程,并考虑七种实际约束条件,在此基础上设计强化学习要素。设置货物码垛的候选缓存区,根据人工启发式生成有价值的先验知识,以此来初始化深度强化学习算法的训练过程,最终经过对决网络评估后输出最优动作。实验结果表明,算法空间利用率为85.3%,收敛速度提高25%,决策时间平均快15 ms,有效解决了面对大规模动作空间增长导致的智能体初期探索困难的问题,提高了算法的效率和实用性,更适用于实际在线装箱场景。

关键词: NP-Hard问题, 在线三维装箱, 候选启发式, 深度强化学习, 马尔可夫决策

Abstract: Cargo loading is a key part of the logistics transportation process, which belongs to the NP-Hard problem. In order to solve the real-time problem of “real-time palletizing” in the field of intelligent logistics, a candidate heuristic deep reinforcement learning algorithm for online 3D case loading is proposed. The palletizing process of online 3D bin packing is expressed as a Markov decision process, and the reinforcement learning elements are designed to satisfy seven practical constraints. A candidate cache is set on the basis of the deep reinforcement learning algorithm. Candidate solutions are generated according to the heuristic experience, from which the algorithm filters the optimal solutions for training, and finally outputs the optimal action after evaluation by the dueling network. Experimental results show that the algorithm has a space utilization of 85.3%, a 25% improvement in convergence speed, and an average of 15 ms faster decision time, which effectively solves the problem of initial exploration difficulty of the intelligent body caused by facing the large-scale growth of the action space, and improves the efficiency and practicability of the algorithm, which is more suitable for the actual on-line crating scenarios.

Key words: NP-Hard problem, online 3D bin packing, candidate heuristic, deep reinforcement learning, Markov decision