Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (6): 162-170.DOI: 10.3778/j.issn.1002-8331.2110-0205

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Double Deep Q-Network by Fusing Contrastive Predictive Coding

LIU Jianfeng, PU Jiexin, SUN Lifan   

  1. School of Information Engineering, Henan University of Science and Technology, Luoyang, Henan 471023, China
  • Online:2023-03-15 Published:2023-03-15

融合对比预测编码的深度双Q网络

刘剑锋,普杰信,孙力帆   

  1. 河南科技大学 信息工程学院,河南 洛阳 471023

Abstract: In the model unknown partially observable Markov decision process(POMDP), the agent cannot directly access the true state of environment, and the perceptual uncertainty poses challenges for learning the optimal policy. Thus, a double deep Q-network reinforcement learning algorithm based on the representation of the contrastive predictive coding is proposed. The belief states are modeled explicitly to obtain a compact and efficient history encoding for the policy optimization. To improve data efficiency, the belief replay buffer is introduced to reduce the memory usage by directly storing the belief transition pairs instead of the observation and action sequences. In addition, the phased training strategy is designed for decoupling the representation learning from the policy learning process to improve training stability. The POMDP navigation tasks based on the Gym-MiniGrid environment are designed. Experimental results show that the semantic information related to the state can be captured by the proposed algorithm, which facilitates to achieve stable and efficient policy learning in POMDP.

Key words: partially observable Markov decision process(POMDP), representation learning, reinforcement learning, contrastive predictive coding, double deep Q-network

摘要: 在模型未知的部分可观测马尔可夫决策过程(partially observable Markov decision process,POMDP)下,智能体无法直接获取环境的真实状态,感知的不确定性为学习最优策略带来挑战。为此,提出一种融合对比预测编码表示的深度双Q网络强化学习算法,通过显式地对信念状态建模以获取紧凑、高效的历史编码供策略优化使用。为改善数据利用效率,提出信念回放缓存池的概念,直接存储信念转移对而非观测与动作序列以减少内存占用。此外,设计分段训练策略将表示学习与策略学习解耦来提高训练稳定性。基于Gym-MiniGrid环境设计了POMDP导航任务,实验结果表明,所提出算法能够捕获到与状态相关的语义信息,进而实现POMDP下稳定、高效的策略学习。

关键词: 部分可观测马尔可夫决策过程, 表示学习, 强化学习, 对比预测编码, 深度双Q网络