拟双曲动量梯度的对抗深度强化学习研究

doi:10.3778/j.issn.1002-8331.2012-0082

摘要/Abstract

摘要：

在深度强化学习（Deep Reinforcement Learning，DRL）中，智能体（agent）通过观察通道来观察环境状态。该观察可能包含对抗性攻击的干扰，也即对抗样本，使智能体选择了错误动作。生成对抗样本常用方法是采用随机梯度下降方法。提出使用拟双曲动量梯度算法（QHM）来生成对抗干扰，该方法能够充分利用以前的梯度动量来修正梯度下降方向，因而比采用随机梯度下降方法（SGD）在生成对抗样本上具有更高效率。同时借助这种攻击方法在鲁棒控制框架内训练了DRL鲁棒性。实验效果表明基于QHM训练方法的DRL在进行对抗性训练后，面对攻击和环境参数变化时的鲁棒性显著提高。

关键词: 深度强化学习, 对抗性攻击, 拟双曲动量梯度, 损失函数

Abstract:

In Deep Reinforcement Learning（DRL）, the agent observes the state of the environment through observation channels. The observation may include the interference of adversarial attacks, making the observation result far away from the real environment state. The engineering loss function with Quasi-Hyperbolic Momentum gradient algorithm（QHM） is used to further improve the attack, which will reduce the performance of the original DRL algorithm（for example, deep double-Q network, DDQN）. Then this attack is used to improve the robustness of DRL within the robust control framework. After the adversarial training of QHM-based DRL, the robustness to the original environmental parameter changes is significantly improved. In addition, several adversarial attacks are compared. Compared with other adversarial attacks, QHM-based DRL has significantly improved attack and defense capabilities.

Key words: deep reinforcement learning, adversarial attack, quasi-hyperbolic momentum gradient, loss function

马志豪，朱响斌. 拟双曲动量梯度的对抗深度强化学习研究[J]. 计算机工程与应用, 2021, 57(24): 90-99.

MA Zhihao, ZHU Xiangbin. Research on Quasi-hyperbolic Momentum Gradient for Adversarial Deep Reinforcement Learning[J]. Computer Engineering and Applications, 2021, 57(24): 90-99.

参考文献

[1] BEHZADAN V，MUNIR A.Whatever does not kill deep reinforcement learning，makes it stronger[J].arXiv：1712.
09344，2017.
[2] YUAN X，HE P，ZHU Q，et al.Adversarial examples：attacks and defenses for deep learning[J].IEEE Transactions on Neural Networks and Learning Systems，2019，30（9）：2805-2824.
[3] LIN Y C，HONG Z W，LIAO Y H，et al.Tactics of adversarial attack on deep reinforcement learning agents[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence，2017：3756-3762.
[4] GOODFELLOW I J，SHLENS J，SZEGEDY C.Explaining and harnessing adversarial examples[J].arXiv：1412.6572，2014.
[5] HUANG S，PAPERNOT N，GOODFELLOW I，et al.Adversarial attacks on neural network policies[J].arXiv：1702.
02284，2017.
[6] SZEGEDY C，ZAREMBA W，SUTSKEVER I，et al.Intriguing properties of neural networks[J].arXiv：1312.6199，2013.
[7] MOOSAVI-DEZFOOLI S M，FAWZI A，FAWZI O，et al.Universal adversarial perturbations[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：1765-1773.
[8] SARKAR S，BANSAL A，MAHBUB U，et al.UPSET and ANGRI：breaking high performance image classifiers[J].arXiv：1707.01159，2017.
[9] KURAKIN A，GOODFELLOW I，BENGIO S.Adversarial examples in the physical world[J].arXiv：1607.02533，2016.
[10] PATTANAIK A，TANG Z，LIU S，et al.Robust deep reinforcement learning with adversarial attacks[C]//Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems，2018：2040-2042.
[11] LIN Y C，HONG Z W，LIAO Y H，et al.Tactics of adversarial attack on deep reinforcement learning agents[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence，2017：3756-3762.
[12] BEHZADAN V，MUNIR A.Mitigation of policy manipulationattacks on deep q-networks with parameter-space noise[C]//Proceedings of the International Conference on Computer Safety，Reliability，and Security.Cham：Springer，2018：406-417.
[13] HAN Y，RUBINSTEIN B I P，ABRAHAM T，et al.Reinforcement learning for autonomous defence in software-defined networking[C]//Proceedings of International Conference on Decision and Game Theory for Security，2018：145-165.
[14] KOS J，SONG D.Delving into adversarial attacks on deep policies[J].arXiv：1705.06452，2017.
[15] GAO M，MA L，LIU H，et al.Malicious network traffic detection based on deep neural networks and association analysis[J].Sensors，2020，20（5）：1452.
[16] BEHZADAN V，MUNIR A.Vulnerability of deep reinforcement learning to policy induction attacks[C]//International Conference on Machine Learning and Data Mining in Pattern Recognition.Cham：Springer，2017：262-275.
[17] MANDLEKAR A，ZHU Y，GARG A，et al.Adversarially robust policy learning：active construction of physically-plausible perturbations[C]//2017 IEEE/RSJ International Conference on Intelligent Robots and Systems（IROS），2017.
[18] ROBBINS H，MONRO S.A stochastic approximation method[J].The Annals of Mathematical Statistics，1951，22（3）：400-407.
[19] RUDER S.An overview of gradient descent optimization algorithms[J].arXiv：1609.04747，2016.
[20] NESTEROV Y E.A method for solving the convex programming problem with convergence rate [O(1/k2)][C]//Dokl Akad Nauk Sssr，1983，269：543-547.
[21] MA J，YARATS D.Quasi-hyperbolic momentum and Adam for deep learning[C]//International Conference on Learning Representations，2018.
[22] SUTSKEVER I，MARTENS J，DAHL G，et al.On the importance of initialization and momentum in deep learning[C]//International Conference on Machine Learning，2013：1139-1147.