基于内在奖励的强化学习推荐探索策略

doi:10.3778/j.issn.1002-8331.2311-0037

摘要/Abstract

摘要： 近年来强化学习算法被引入推荐系统以解决探索-利用问题，改善平台上用户体验并提升系统长期效益。现有研究主要从模型层面进行探索策略设计，但大部分工作很少考虑用户体验对探索策略的影响。提出通过修改奖励的方式设计探索策略，充分考虑强化学习在推荐场景下将用户建模为环境的特殊性，将商品多样性和新颖性作为内在奖励，利用用户体验指导模型的探索方向。在两个不同类型的真实数据集上进行实验，实验结果表明所提出方法在推荐性能和推荐商品多样性等各项指标上实现了明显的效果提升，验证了所提出探索策略的有效性。

关键词: 推荐系统, 强化学习, 探索策略

Abstract: In recent years, reinforcement learning algorithms have been introduced into recommender systems to address the exploration-exploitation dilemma, enhancing user experience in recommender systems and boosting long-term benefits. Existing studies mainly focus on the design of exploration strategies at the model level, with little consideration for the impact of user experience on exploration strategies. This study proposes to design an exploration strategy by modifying rewards, taking into account the uniqueness of modeling users as the environment in reinforcement learning for recommendation scenarios. Specifically, the diversity and novelty of items are selected as intrinsic rewards, guiding the model’s exploration direction based on user experience. Experiments are conducted on two different types of real-world datasets, and the results demonstrate significant performance improvements in recommendation accuracy and diversity of recommended items, validating the effectiveness of the proposed exploration strategy.

Key words: recommender systems, reinforcement learning, exploration strategy

庾源清, 马为之, 张敏. 基于内在奖励的强化学习推荐探索策略[J]. 计算机工程与应用, 2025, 61(7): 188-195.

YU Yuanqing, MA Weizhi, ZHANG Min. Exploration Strategy in Reinforcement Learning Based on Intrinsic Reward for Recommendation[J]. Computer Engineering and Applications, 2025, 61(7): 188-195.

参考文献

[1] KOREN Y, BELL R, VOLINSKY C. Matrix factorization techniques for recommender systems[J]. Computer, 2009, 42(8): 30-37.
[2] HE X N, LIAO L Z, ZHANG H W, et al. Neural collaborative filtering[C]//Proceedings of the 26th International Conference on World Wide Web. New York: ACM, 2017: 173-182.
[3] COVINGTON P, ADAMS J, SARGIN E, et al. Deep neural networks for YouTube recommendations[C]//Proceedings of the 10th ACM Conference on Recommender Systems. New York: ACM, 2016: 191-198.
[4] FORTUNATO M, AZAR M G, PIOT B, et al. Noisy networks for exploration[J]. arXiv:1706.10295, 2017.
[5] SHANI G, HECKERMAN D, BRAFMAN R I. An MDP-based recommender system[J]. Journal of Machine Learning Research, 2005, 6: 1265-1295.
[6] ZHENG G J, ZHANG F Z, ZHENG Z H, et al. DRN: a deep reinforcement learning framework for news recommendation[C]//Proceedings of the 2018 World Wide Web Conference. New York: ACM, 2018: 167-176.
[7] CHEN M M, BEUTEL A, COVINGTON P, et al. Top-K off-policy correction for a REINFORCE recommender system[C]//Proceedings of the 12th ACM International Conference on Web Search and Data Mining. New York: ACM, 2019: 456-464.
[8] WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3/4): 229-256.
[9] GAO C M, WANG S Q, LI S J, et al. CIRS: bursting filter bubbles by counterfactual interactive recommender system[J]. ACM Transactions on Information Systems, 2023, 42(1): 1-27.
[10] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[J]. arXiv:1707.06347, 2017.
[11] DU C, GAO Z F, YUAN S, et al. Exploration in online advertising systems with deep uncertainty-aware learning[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. New York: ACM, 2021: 2792-2801.
[12] WU K, BIAN W, CHAN Z, et al. Adversarial gradient driven exploration for deep click-through rate prediction[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022: 2050-2058.
[13] AUER P, CESA-BIANCHI N, FISCHER P. Finite-time analysis of the multiarmed bandit problem[J]. Machine Learning, 2002, 47(2): 235-256.
[14] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 2018.
[15] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. New York: ACM, 2016: 1928-1937.
[16] PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2017: 488-489.
[17] KIM H, KIM J, JEONG Y, et al. EMI: exploration with mutual information[C]//Proceedings of the International Conference on Machine Learning, 2019: 3360-3369.
[18] CHEN M M, WANG Y Y, XU C, et al. Values of user exploration in recommender systems[C]//Proceedings of the 15th ACM Conference on Recommender Systems. New York: ACM, 2021: 85-95.
[19] WANG S Q, GAO C M, GAO M, et al. Who are the best adopters?User selection model for free trial item promotion[J]. IEEE Transactions on Big Data, 2023, 9(2): 746-757.
[20] OUDEYER P Y, KAPLAN F. How can we define intrinsic motivation?[C]//Proceedings of the 8th International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic System, 2008.
[21] HUANG J, OOSTERHUIS H, CETINKAYA B, et al. State encoders in reinforcement learning for recommendation: a reproducibility study[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2022: 2738-2748.
[22] CHEN M M, XU C, GATTO V, et al. Off-policy actor-critic for recommender systems[C]//Proceedings of the 16th ACM Conference on Recommender Systems. New York: ACM, 2022: 338-349.
[23] TANG J X, WANG K. Personalized top-N sequential recommendation via convolutional sequence embedding[C]//Proceedings of the 11th ACM International Conference on Web Search and Data Mining. New York: ACM, 2018: 565-573.
[24] YUAN F J, KARATZOGLOU A, ARAPAKIS I, et al. A simple convolutional generative network for next item recommendation[C]//Proceedings of the 12th ACM International Conference on Web Search and Data Mining. New York: ACM, 2019: 582-590.
[25] KANG W C, MCAULEY J. Self-attentive sequential recommendation[C]//Proceedings of the 2018 IEEE International Conference on Data Mining. Piscataway: IEEE, 2018: 197-206.
[26] XIN X, PIMENTEL T, KARATZOGLOU A, et al. Rethinking reinforcement learning for recommendation: a prompt perspective[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2022: 1347-1357.
[27] KONDA V, TSITSIKLIS J. Actor-critic algorithms[C]//Advances in Neural Information Processing Systems, 1999.
[28] SWAMINATHAN A, JOACHIMS T. Counterfactual risk minimization: learning from logged bandit feedback[C]//Proceedings of the International Conference on Machine Learning, 2015: 814-823.
[29] YU T, THOMAS G, YU L, et al. MOPO: model-based offline policy optimization[C]//Advances in Neural Information Processing Systems, 2020: 14129-14142.
[30] SUTTON R S, MCALLESTER D, SINGH S, et al. Policy gradient methods for reinforcement learning with function approximation[C]//Advances in Neural Information Processing Systems, 1999.
[31] SILVEIRA T, ZHANG M, LIN X, et al. How good your recommender system is? A survey on evaluations in recommendation[J]. International Journal of Machine Learning and Cybernetics, 2019, 10(5): 813-831.
[32] JANNER M, FU J, ZHANG M, et al. When to trust your model: model-based policy optimization[C]//Advances in Neural Information Processing Systems, 2019.
[33] GUO H F, TANG R M, YE Y M, et al. DeepFM: a factorization-machine based neural network for CTR prediction[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia: AAAI Press, 2017: 1725-1731.
[34] XU S Y, TAN J T, FU Z H, et al. Dynamic causal collaborative filtering[C]//Proceedings of the 31st ACM International Conference on Information & Knowledge Management. New York: ACM, 2022: 2301-2310.