计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (7): 20-23.DOI: 10.3778/j.issn.1002-8331.2009.07.007

• 博士论坛 • 上一篇    下一篇

基于内部结构MPOMDP模型的策略梯度学习算法

张润梅1,2,王 浩1,张佑生1,姚宏亮1,方长胜1   

  1. 1.合肥工业大学 计算机与科学技术系,合肥 230009
    2.安徽建筑工业学院 电子与信息工程学院,合肥 230022
  • 收稿日期:2008-10-22 修回日期:2008-12-04 出版日期:2009-03-01 发布日期:2009-03-01
  • 通讯作者: 张润梅

Policy gradient algorithm based on internal structural MPOMDP model

ZHANG Run-mei1,2,WANG Hao1,ZHANG You-sheng1,YAO Hong-liang1,FANG Chang-sheng1   

  1. 1.Department of Computer Science and Technology,Hefei University of Technology,Hefei 230009,China
    2.School of Electronics and Information Engineering,Anhui University of Architecture,Hefei 230022,China
  • Received:2008-10-22 Revised:2008-12-04 Online:2009-03-01 Published:2009-03-01
  • Contact: ZHANG Run-mei

摘要: 为了提高MPOMDP模型的知识表示能力和推理效率,提出一种基于Agent内部结构的MPOMDP模型。该模型能表示Agent的内部结构及其时间演化,并通过将系统联合概率分布表示成每个Agent内部变量集的局部因式形式,以提高模型的推理效率。将GPI-POMDP算法扩展到基于内部结构的MPOMDP模型中,给出基于内部状态的多Agent策略梯度算法(MIS-GPOMDP),来求解基于内部结构的MPOMDP。实验结果表明MIS-GPOMDP算法具有较高的推理效率,且算法是收敛的。

关键词: 马尔可夫决策过程, 强化学习, MPOMDP模型, 策略梯度算法

Abstract: For the improvement of knowledge representation ability and reasoning efficiency of MPOMDP model,a new kind of MPOMDP model is proposed based on internal structure of Agent.The internal structure and its evolvement of Agent are presented to improve the reasoning efficiency of the model by means the joint probability distribution of system as the local factorization forms of internal variables set.A MIS-GPOMDP algorithm is given by expanding GPI-POMDP to internal structural MPOMDP model to solve the internal structural MPOMDP model.The results of the experiment show that the high efficiency of the reasoning and convergence are found in MIS-GPOMDP algorithm.

Key words: arkov Decision Processes(MDP), reinforcement learning, MPOMDP model, policy gradient algorithm