Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (5): 148-154.DOI: 10.3778/j.issn.1002-8331.2109-0475

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Intrinsic Reward Method Combining Novelty and Risk Assessment

ZHAO Ying, QIN Jin, YUAN Linlin   

  1. 1.College of Computer Science & Technology, Guizhou University, Guiyang 550025, China
    2.School of Information Engineering, Guizhou Open University, Guiyang 550023, China
  • Online:2023-03-01 Published:2023-03-01

结合新颖性和风险评估的内在奖励方法

赵英,秦进,袁琳琳   

  1. 1.贵州大学 计算机科学与技术学院,贵阳 550025
    2.贵州开放大学 信息工程学院,贵阳 550023

Abstract: Reinforcement learning algorithms rely on well-designed external rewards. However, when agents interact with the environment, the external rewards fed back to agents by the environment are often very rare or delayed, which leads to the failure of agents to learn a good strategy. To solve this problem, an intrinsic reward is designed from the aspects of novelty and risk assessment, so that the Agent can fully explore the environment and consider the uncertain actions in the environment. The method is divided into two parts. First, novelty is described as the number of visits to the current state-action and post-transition state, taking the specific action performed into account. The second is the risk degree of the action. The risk assessment considers the cumulative reward variance to judge whether the current action is risky or risk-free to the state. The method is evaluated in the Mujoco environment, and the experiment verifies that the method achieves a higher average reward value, especially in the case of extrinsic reward delay. It shows that this method can effectively solve the problem of sparse extrinsic rewards.

Key words: reinforcement learning, novelty, risk assessment, intrinsic reward

摘要: 强化学习算法依赖于精心设计的外在奖励,然而Agent在和环境交互过程中,环境反馈给Agent的外在奖励往往是非常稀少的或延迟,这导致了Agent无法学习到一个好的策略。为了解决该问题,从新颖性和风险评估这两方面设计一个内在奖励,使Agent能充分地探索环境以及考虑环境中存在不确定性动作。该方法分为两部分,首先是新颖性描述为对当前状态-动作和转换后状态的访问次数,将具体执行的动作考虑进去;其次是动作的风险程度,风险评估从累积奖励方差考虑,来判断当前动作对状态的意义是有风险的还是无风险的。该方法在Mujoco环境下进行了评估,实验验证该方法取得了更高的平均奖励值,尤其是在外在奖励延迟的情况下,也能取得不错的平均奖励值。说明该方法能有效地解决外在奖励稀疏的问题。

关键词: 强化学习, 新颖性, 风险评估, 内在奖励