Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (12): 237-242.DOI: 10.3778/j.issn.1002-8331.2003-0382

Previous Articles     Next Articles

Application of Multi-armed Bandit Algorithm with Time-Varying Rewardsin Dynamic Pricing

QIAO Xunshuang, BI Wenjie   

  1. Business School, Central South University, Changsha 410000, China
  • Online:2021-06-15 Published:2021-06-10



  1. 中南大学 商学院,长沙 410000


Considering that dynamic pricing is a nonstationary Multi-Armed Bandit(MAB) problem. That is, the profit of the manufacturer is time-varying, so based on the previous research, this paper studies the application of the Upper Confidence Bound(UCB) algorithm with time-varying reward in dynamic pricing. This paper describes the pricing problem as a multi-armed problem, and constructs a profit maximization model to get the optimal solution. The simulation results show that the proposed algorithm converges faster, the rewards learned are closer to the real rewards. Compared with previous studies, this model takes time-varying factors into account, which is more in line with the dynamic pricing in the real scenario, and provides the corresponding decision support for the pricing of manufacturers.

Key words: multi-armed bandit algorithm, dynamic pricing, upper confidence bound


考虑到动态定价是一个非固定性的多摇臂(Multi-Armed Bandit,MAB)问题,即厂商的利润会随时间变化,因此在相关研究基础上,研究了需求不确定情况下考虑时变奖励的置信区间上界(Upper Confidence Bound,UCB)算法在动态定价问题上的应用。将商品定价问题描述为一个多摇臂问题,并构建利润最大化模型求得最优解。仿真结果表明,通过将考虑时变奖励的置信区间上界算法与基础的多摇臂算法进行对比分析,所提出的算法学得的奖励更加接近真实奖励,收敛速度更快。相较于前人研究,该模型考虑了时变因素,更加符合现实场景中的动态定价,为厂商定价提供了相应的决策支持。

关键词: 多摇臂算法, 动态定价, 置信区间上界算法