基于生成对抗网络的最大熵逆强化学习

doi:10.3778/j.issn.1002-8331.1904-0238

计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (22): 119-126.DOI: 10.3778/j.issn.1002-8331.1904-0238

基于生成对抗网络的最大熵逆强化学习

陈建平，陈其强，傅启明，高振，吴宏杰，陆悠

1.苏州科技大学电子与信息工程学院，江苏苏州 215009
2.苏州科技大学江苏省建筑智慧节能重点实验室，江苏苏州 215009

出版日期:2019-11-15 发布日期:2019-11-13

Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks

CHEN Jianping, CHEN Qiqiang, FU Qiming, GAO Zhen, WU Hongjie, LU You

1.College of Electronics and Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China
2.Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China

Online:2019-11-15 Published:2019-11-13

摘要/Abstract

摘要： 针对逆强化学习算法在训练初期由于专家样本稀疏所导致的学习速率慢的问题，提出一种基于生成对抗网络（Generative Adversarial Networks，GAN）的最大熵逆强化学习算法。在学习过程中，结合专家样本训练优化生成对抗网络，以生成虚拟专家样本，在此基础上利用随机策略生成非专家样本，构建混合样本集，结合最大熵概率模型，对奖赏函数进行建模，并利用梯度下降方法求解最优奖赏函数。基于所求解的最优奖赏函数，利用正向强化学习方法求解最优策略，并在此基础上进一步生成非专家样本，重新构建混合样本集，迭代求解最优奖赏函数。将所提出的算法与MaxEnt IRL算法应用于经典的Object World与Mountain Car问题，实验表明，该算法在专家样本稀疏的情况下可以较好地求解奖赏函数，具有较好的收敛性能。

关键词: 生成对抗网络（GAN）, 逆强化学习, 最大熵

Abstract: Aiming at the problem that the inverse reinforcement learning algorithm is slow in learning rate due to the sparseness of expert samples in the early stage of training, a maximum entropy inverse reinforcement learning algorithm based on Generative Adversarial Networks（GAN） is proposed. In the learning process, the expert samples are used to train and optimize the GAN to generate the virtual expert samples. Based on this, the non-expert samples are generated by using the stochastic policy and the mixed sample set is constructed. The maximum entropy probability model is combined to model the reward function, and the gradient descent method is used to solve the optimal reward function. Based on the optimal reward function, the forward reinforcement learning method is used to solve the optimal policy. On this basis, non-expert samples are further generated, the mixed sample set is reconstructed, and the optimal reward function is solved iteratively. The proposed algorithm and MaxEnt IRL algorithm are applied to the classic Object World and Mountain Car problems. Experiments show that the algorithm can solve the reward function better when the expert samples are sparse, and has better convergence performance.

Key words: Generative Adversarial Networks（GAN）, inverse reinforcement learning, maximum entropy

陈建平，陈其强，傅启明，高振，吴宏杰，陆悠. 基于生成对抗网络的最大熵逆强化学习[J]. 计算机工程与应用, 2019, 55(22): 119-126.

CHEN Jianping, CHEN Qiqiang, FU Qiming, GAO Zhen, WU Hongjie, LU You. Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks[J]. Computer Engineering and Applications, 2019, 55(22): 119-126.

[1]	柴旭，方明，付飞蚺，邵桢. 考场环境下考生视线估计方法[J]. 计算机工程与应用, 2021, 57(9): 199-206.
[2]	王海涌，李海洋，高雪娇. 基于结构嵌入的图像修复方法研究[J]. 计算机工程与应用, 2021, 57(22): 241-246.
[3]	吴春梅，胡军浩，尹江华. 利用改进生成对抗网络进行人体姿态识别[J]. 计算机工程与应用, 2020, 56(8): 96-103.
[4]	伍鑫，黄勃，方志军，刘文竹. 序列生成对抗网络在推荐系统中的应用[J]. 计算机工程与应用, 2020, 56(23): 175-179.
[5]	刘有用，张江梅，王坤朋，冯兴华，杨秀洪. 不平衡数据集下的水下目标快速识别方法[J]. 计算机工程与应用, 2020, 56(17): 236-242.
[6]	周婉莹，马盈仓，续秋霞，郑毅. 最大熵和[l2,0]范数约束的无监督特征选择算法[J]. 计算机工程与应用, 2020, 56(11): 51-59.
[7]	李帅龙1，2，3，张会文1，2，3，周维佳1，2. 模仿学习方法综述及其在机器人领域的应用[J]. 计算机工程与应用, 2019, 55(4): 17-30.
[8]	张逸，谷毅，韩芳，王直杰. 基于生成对抗网络的音频音质提升方法[J]. 计算机工程与应用, 2019, 55(20): 240-244.
[9]	夏吾吉1，2，华却才让1. 基于混合策略的藏文人称代词指代消解研究[J]. 计算机工程与应用, 2018, 54(7): 66-69.
[10]	邵良杉1，赵琳琳1，温廷新2，孔祥博2. 基于区间直觉模糊数的双向投影决策模型[J]. 计算机工程与应用, 2017, 53(1): 83-86.
[11]	刘颖，王楠. 最大熵模型和BP神经网络的短句对齐比较[J]. 计算机工程与应用, 2015, 51(7): 112-117.
[12]	古丽扎达·海沙1，古丽拉·阿东别克2，3. 哈萨克语动词短语自动识别研究与实现[J]. 计算机工程与应用, 2015, 51(2): 218-223.
[13]	谷晶晶，周国栋. 基于分词与词性标注的汉语逗号自动分类[J]. 计算机工程与应用, 2015, 51(18): 120-125.
[14]	吴鹏. 萤火虫算法优化最大熵的图像分割方法[J]. 计算机工程与应用, 2014, 50(12): 115-119.
[15]	汪国强，曲晶莹. 改进分水岭医学图像分割方法的研究[J]. 计算机工程与应用, 2013, 49(8): 185-187.

基于生成对抗网络的最大熵逆强化学习

Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics