计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (10): 264-270.DOI: 10.3778/j.issn.1002-8331.1806-0324

• 工程与应用 • 上一篇    

改进DDPG算法在自动驾驶中的应用

张  斌1,何  明1,2,陈希亮1,吴春晓1,刘  斌1,周  波1   

  1. 1.中国人民解放军陆军工程大学 指挥控制工程学院,南京 210002
    2.军事科学院 系统工程研究院 网络信息研究所,北京 100071
  • 出版日期:2019-05-15 发布日期:2019-05-13

Self-Driving Via Improved DDPG Algorithm

ZHANG Bin1, HE Ming1,2, CHEN Xiliang1, WU Chunxiao1, LIU Bin1, ZHOU Bo1   

  1. 1.College of Command and Control Engineering, The Army Engineering University of PLA, Nanjing 210002, China
    2.Institute of Network Information, Academy of Systems Engineering, Academy of Military Sciences, Beijing 100071, China
  • Online:2019-05-15 Published:2019-05-13

摘要: 深度确定性策略梯度算法(Deep Deterministic Policy Gradient,DDPG)作为深度强化学习中的经典算法,在连续控制问题上有着较大的优势,被应用于自动驾驶领域。针对DDPG缺少策略动作过滤导致的非法策略比例较高引起的训练效率低、收敛速度慢等问题,提出基于失败经验纠错的深度确定性策略梯度算法。通过分离经验缓存池,根据驾驶表现选择失败数据训练,并将策略网络单输出转化为油门和刹车控制量,通过正态分布噪声改善探索策略。TORCS平台仿真实验表明,所提算法相对于DDPG算法与DQN(Deep Q-learning Network)算法,训练效率明显提升,非法驾驶策略降低为0。

关键词: 深度强化学习, 自动驾驶, DDPG算法, 经验缓存分离, TORCS

Abstract: As a classic algorithm of deep reinforcement learning, the Deep Deterministic Policy Gradient algorithm(DDPG) has great advantage on the aspect of continuous control problems and is applied in self-driving area. In order to solve the problems of low training efficiency and large amount of illegal driving policy, an improved algorithm called failure experience correction DDPG is proposed. The algorithm divides experience pool into success experience pool and failure experience pool, selects failure experience according to the driving performance, controlls the brake pedal and acceleration pedal via one neural network output, and explores unknown policy through normal distribution noisy. Through the simulation on the TORCS platform, experimental results show that the proposed algorithm can significantly improve the training efficiency and reduce the illegal driving policy to zero.

Key words: deep reinforcement learning, self-driving, DDPG algorithm, experience pool dividing, TORCS