通过优惠指导的蒸汽探索,取样高效深层强化学习 (Sampling Efficient Deep Reinforcement Learning through Preference-Guided Stochastic Exploration)

Massive practical works addressed by Deep Q-network (DQN) algorithm have indicated that stochastic policy, despite its simplicity, is the most frequently used exploration approach. However, most existing stochastic exploration approaches either explore new actions heuristically regardless of Q-values or inevitably introduce bias into the learning process to couple the sampling with Q-values. In this paper, we propose a novel preference-guided $\epsilon$-greedy exploration algorithm that can efficiently learn the action distribution in line with the landscape of Q-values for DQN without introducing additional bias. Specifically, we design a dual architecture consisting of two branches, one of which is a copy of DQN, namely the Q-branch. The other branch, which we call the preference branch, learns the action preference that the DQN implicit follows. We theoretically prove that the policy improvement theorem holds for the preference-guided $\epsilon$-greedy policy and experimentally show that the inferred action preference distribution aligns with the landscape of corresponding Q-values. Consequently, preference-guided $\epsilon$-greedy exploration motivates the DQN agent to take diverse actions, i.e., actions with larger Q-values can be sampled more frequently whereas actions with smaller Q-values still have a chance to be explored, thus encouraging the exploration. We assess the proposed method with four well-known DQN variants in nine different environments. Extensive results confirm the superiority of our proposed method in terms of performance and convergence speed. Index Terms- Preference-guided exploration, stochastic policy, data efficiency, deep reinforcement learning, deep Q-learning.

翻译：深Q网络(DQN)算法所处理的大规模实际工程表明,尽管政策简单,但最常用的勘探方法却是随机化政策,然而,大多数现有的随机化勘探方法要么超常地探索新的行动,而不管Q值如何,要么不可避免地在学习过程中引入偏差,将抽样与Q值相提并论。在本文中,我们建议采用一种新的偏好引导 $\epsilon$-greedy 勘探算法,该算法能够有效学习符合DQN Q值景观的行动分布,而不会引入其他偏差。具体地说,我们设计由两个分支组成的双轨结构,其中一个是DQN(Q-branch)的复制件。另一个分支,我们称之为偏好部门,了解DQN隐含的行动偏好。我们理论上的偏好性理论用于偏爱制 $\epslon-greed 政策以及实验显示,所推断的行动偏好于相应的Q-value Q-al-n-lational-de Q-deal-deal-deal-deal-deal-deal exal-de exal-dection Arevational-deal-d devational-d devational-d devational-deal-deal-demental-demental-d devational-de leg-d leg-deal-de leg-d leg-de leg-demental) 。因此,因此,因此,因此,因此,因此,我们称,我们称为更优度-d-dal-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-d-deal-d-d-d-d-dal-d-d-d-d-d-d-d-d-