计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (22): 119-126.DOI: 10.3778/j.issn.1002-8331.1904-0238

• 模式识别与人工智能 • 上一篇    下一篇

基于生成对抗网络的最大熵逆强化学习

陈建平,陈其强,傅启明,高振,吴宏杰,陆悠   

  1. 1.苏州科技大学 电子与信息工程学院,江苏 苏州 215009
    2.苏州科技大学 江苏省建筑智慧节能重点实验室,江苏 苏州 215009
  • 出版日期:2019-11-15 发布日期:2019-11-13

Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks

CHEN Jianping, CHEN Qiqiang, FU Qiming, GAO Zhen, WU Hongjie, LU You   

  1. 1.College of Electronics and Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China
    2.Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China
  • Online:2019-11-15 Published:2019-11-13

摘要: 针对逆强化学习算法在训练初期由于专家样本稀疏所导致的学习速率慢的问题,提出一种基于生成对抗网络(Generative Adversarial Networks,GAN)的最大熵逆强化学习算法。在学习过程中,结合专家样本训练优化生成对抗网络,以生成虚拟专家样本,在此基础上利用随机策略生成非专家样本,构建混合样本集,结合最大熵概率模型,对奖赏函数进行建模,并利用梯度下降方法求解最优奖赏函数。基于所求解的最优奖赏函数,利用正向强化学习方法求解最优策略,并在此基础上进一步生成非专家样本,重新构建混合样本集,迭代求解最优奖赏函数。将所提出的算法与MaxEnt IRL算法应用于经典的Object World与Mountain Car问题,实验表明,该算法在专家样本稀疏的情况下可以较好地求解奖赏函数,具有较好的收敛性能。

关键词: 生成对抗网络(GAN), 逆强化学习, 最大熵

Abstract: Aiming at the problem that the inverse reinforcement learning algorithm is slow in learning rate due to the sparseness of expert samples in the early stage of training, a maximum entropy inverse reinforcement learning algorithm based on Generative Adversarial Networks(GAN) is proposed. In the learning process, the expert samples are used to train and optimize the GAN to generate the virtual expert samples. Based on this, the non-expert samples are generated by using the stochastic policy and the mixed sample set is constructed. The maximum entropy probability model is combined to model the reward function, and the gradient descent method is used to solve the optimal reward function. Based on the optimal reward function, the forward reinforcement learning method is used to solve the optimal policy. On this basis, non-expert samples are further generated, the mixed sample set is reconstructed, and the optimal reward function is solved iteratively. The proposed algorithm and MaxEnt IRL algorithm are applied to the classic Object World and Mountain Car problems. Experiments show that the algorithm can solve the reward function better when the expert samples are sparse, and has better convergence performance.

Key words: Generative Adversarial Networks(GAN), inverse reinforcement learning, maximum entropy