深土匪显示脱机:与深网络进行简单而高效的探索 (Deep Bandits Show-Off: Simple and Efficient Exploration with Deep Networks)

Designing efficient exploration is central to Reinforcement Learning due to the fundamental problem posed by the exploration-exploitation dilemma. Bayesian exploration strategies like Thompson Sampling resolve this trade-off in a principled way by modeling and updating the distribution of the parameters of the the action-value function, the outcome model of the environment. However, this technique becomes infeasible for complex environments due to the difficulty of representing and updating probability distributions over parameters of outcome models of corresponding complexity. Moreover, the approximation techniques introduced to mitigate this issue typically result in poor exploration-exploitation trade-offs, as observed in the case of deep neural network models with approximate posterior methods that have been shown to underperform in the deep bandit scenario. In this paper we introduce Sample Average Uncertainty (SAU), a simple and efficient uncertainty measure for contextual bandits. While Bayesian approaches like Thompson Sampling estimate outcomes uncertainty indirectly by first quantifying the variability over the parameters of the outcome model, SAU is a frequentist approach that directly estimates the uncertainty of the outcomes based on the value predictions. Importantly, we show theoretically that the uncertainty measure estimated by SAU asymptotically matches the uncertainty provided by Thompson Sampling, as well as its regret bounds. Because of its simplicity SAU can be seamlessly applied to deep contextual bandits as a very scalable drop-in replacement for epsilon-greedy exploration. Finally, we empirically confirm our theory by showing that SAU-based exploration outperforms current state-of-the-art deep Bayesian bandit methods on several real-world datasets at modest computation cost.

翻译：设计高效的勘探对于加强学习至关重要,因为勘探-开发进退两难造成了根本性问题。Bayesian的勘探战略,如Thompson Sampling等,通过模拟和更新行动-价值功能参数的分布,即环境结果模型,以原则方式解决了这一权衡。然而,由于很难代表和更新对相应复杂结果模型参数的概率分布,这一方法对于复杂的环境而言是行不通的。此外,为缓解这一问题而采用的近似技术通常导致勘探-开采深度交换差,正如在深带宽幅假设中以近似海面方法显示不完善的深色神经网络模型所观察到的。在本文中,我们采用了样本平均不确定性(SAU),这是针对环境强盗的一种简单而有效的不确定性衡量标准。虽然Bayesian采用的方法,如Thompson Spling 估计结果模型参数的变异性,但SAU是一种经常采用的方法,根据我们提供的数值预测直接估计结果的低度误差程度。重要的是,从理论上看,SAU的不确定性衡量,作为Smilimal-deal oral-deforal的计算,作为Slorial-exal-deal-deal-exal-exalal-deal-deal dolviolviolviolviolviolview, 由Syal 由Syal 由Syal 由Syal-Syal-slviolviolviolviolviolviolmal 由Syal 由Syal 由Syal 由Syal 由Slim算为Syal 由Syal 由Sldal 由Sld 由SU 由SU 由SU 由Sl 由Slimal 由Slimal 由Slimal 由Smal 的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确算与Slimal-Slimal-Slimal-Slimal-sal-Slimal的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确的精确推算与Slim的精确的精确的精确的精确的精确的精确的精确的精确的