Intrinsic Reward Method Combining Novelty and Risk Assessment

doi:10.3778/j.issn.1002-8331.2109-0475

Abstract

Abstract: Reinforcement learning algorithms rely on well-designed external rewards. However, when agents interact with the environment, the external rewards fed back to agents by the environment are often very rare or delayed, which leads to the failure of agents to learn a good strategy. To solve this problem, an intrinsic reward is designed from the aspects of novelty and risk assessment, so that the Agent can fully explore the environment and consider the uncertain actions in the environment. The method is divided into two parts. First, novelty is described as the number of visits to the current state-action and post-transition state, taking the specific action performed into account. The second is the risk degree of the action. The risk assessment considers the cumulative reward variance to judge whether the current action is risky or risk-free to the state. The method is evaluated in the Mujoco environment, and the experiment verifies that the method achieves a higher average reward value, especially in the case of extrinsic reward delay. It shows that this method can effectively solve the problem of sparse extrinsic rewards.

Key words: reinforcement learning, novelty, risk assessment, intrinsic reward

摘要： 强化学习算法依赖于精心设计的外在奖励，然而Agent在和环境交互过程中，环境反馈给Agent的外在奖励往往是非常稀少的或延迟，这导致了Agent无法学习到一个好的策略。为了解决该问题，从新颖性和风险评估这两方面设计一个内在奖励，使Agent能充分地探索环境以及考虑环境中存在不确定性动作。该方法分为两部分，首先是新颖性描述为对当前状态-动作和转换后状态的访问次数，将具体执行的动作考虑进去；其次是动作的风险程度，风险评估从累积奖励方差考虑，来判断当前动作对状态的意义是有风险的还是无风险的。该方法在Mujoco环境下进行了评估，实验验证该方法取得了更高的平均奖励值，尤其是在外在奖励延迟的情况下，也能取得不错的平均奖励值。说明该方法能有效地解决外在奖励稀疏的问题。

关键词: 强化学习, 新颖性, 风险评估, 内在奖励

ZHAO Ying, QIN Jin, YUAN Linlin. Intrinsic Reward Method Combining Novelty and Risk Assessment[J]. Computer Engineering and Applications, 2023, 59(5): 148-154.

赵英, 秦进, 袁琳琳. 结合新颖性和风险评估的内在奖励方法[J]. 计算机工程与应用, 2023, 59(5): 148-154.

References

[1] SUTTON R S，BARTO A G.Reinforcement learning：an introduction[M].[S.l.]：MIT Press，2018.
[2] LILLICRAP T P，HUNT J J，PRITZEL A，et al.Continuous control with deep reinforcement learning[J].arXiv：1509. 02971，2015.
[3] OSBAND I，VAN ROY B，RUSSO D J，et al.Deep exploration via randomized value functions[J].J Mach Learn Res，2019，20（124）：1-62.
[4] ZHENG Z，OH J，SINGH S.On learning intrinsic rewards for policy gradient methods[C]//Advances in Neural Information Processing Systems，2018：4644-4654.
[5] DU Y，HAN L，FANG M，et al.LIIR：learning individual intrinsic reward in multi-agent reinforcement learning[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems，2019：4403-4414.
[6] BELLEMARE M，SRINIVASAN S，OSTROVSKI G，et al.Unifying count-based exploration and intrinsic motivation[C]//Advances in Neural Information Processing Systems，2016：1471-1479.
[7] STREHL A L，LITTMAN M L.An analysis of model-based interval estimation for Markov decision processes[J].Journal of Computer and System Sciences，2008，74（8）：1309-1331.
[8] MOHAMED S，JIMENEZ REZENDE D.Variational information maximisation for intrinsically motivated reinforcement learning[C]//Advances in Neural Information Processing Systems，2015：2125-2133.
[9] BURDA Y，EDWARDS H，PATHAK D，et al.Large-scale study of curiosity-driven learning[C]//International Conference on Learning Representations，2018.
[10] KLYUBIN A S，POLANI D，NEHANIV C L.Empowerment：a universal agent-centric measure of control[C]//2005 IEEE Congress on Evolutionary Computation，2005.
[11] PONG V，DALAL M，LIN S，et al.Skew-fit：state-covering self-supervised reinforcement learning[C]//International Conference on Machine Learning，2020：7783-7792.
[12] LAIR N，COLAS C，PORTELAS R，et al.Language grounding through social interactions and curiosity-driven multi-goal learning[C]//NeurIPS Workshop on Visually Grounded Interaction and Language，2019.
[13] DECI E L，RYAN R M.Intrinsic motivation and self-determination in human behavior[M].[S.l.]：Springer Science & Business Media，2013.
[14] BARTO A G.Intrinsic motivation and reinforcement learning[J].Intrinsically Motivated Learning in Natural and Artificial Systems，2013：17.
[15] ZHA D，MA W，YUAN L，et al.Rank the episodes：a simple approach for exploration in procedurally-generated environments[C]//International Conference on Learning Representations，2020.
[16] CORALUPPI S P，MARCUS S I.Risk-sensitive and minimax control of discrete-time，finite-state Markov decision processes[J].Automatica，1999，35（2）：301-309.
[17] RATLIFF L J，MAZUMDAR E.Inverse risk-sensitive reinforcement learning[J].IEEE Transactions on Automatic Control，2019，65（3）：1256-1263.
[18] ZHANG S，LIU B，WHITESON S.Mean-variance policy iteration for risk-averse reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2021：10905-10913.
[19] CHOW Y，GHAVAMZADEH M，JANSON L，et al.Risk-constrained reinforcement learning with percentile risk criteria[J].The Journal of Machine Learning Research，2017，18（1）：6070-6120.
[20] CLEMENTS W R，VAN DELFT B，ROBAGLIA B M，et al.Estimating risk and uncertainty in deep reinforcement learning[J].arXiv：1905.09638，2019.
[21] DABNEY W，ROWLAND M，BELLEMARE M G，et al.Distributional reinforcement learning with quantile regression[C]//Thirty-Second AAAI Conference on Artificial Intelligence，2018.
[22] BISI L，SABBIONI L，VITTORI E，et al.Risk-averse trust region optimization for reward-volatility reduction[C]//29th International Joint Conference on Artificial Intelligence，2020：4583-4589.
[23] TAMAR A，DI CASTRO D，MANNOR S.Learning the variance of the reward-to-go[J].The Journal of Machine Learning Research，2016，17（1）：361-396.
[24] SCHULMAN J，WOLSKI F，DHARIWAL P，et al.Proximal policy optimization algorithms[J].arXiv：1707.06347，2017.
[25] TANG H，HOUTHOOFT R，FOOTE D，et al.# Exploration：a study of count-based exploration for deep reinforcement learning[C]//31st Conference on Neural Information Processing Systems（NIPS），2017：1-18.
[26] CHARIKAR M S.Similarity estimation techniques from rounding algorithms[C]//Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing，2002：380-388.