Thompson sampling is a well-known approach for balancing exploration and exploitation in reinforcement learning. It requires the posterior distribution of value-action functions to be maintained; this is generally intractable for tasks that have a high dimensional state-action space. We derive a variational Thompson sampling approximation for DQNs which uses a deep network whose parameters are perturbed by a learned variational noise distribution. We interpret the successful NoisyNets method \cite{fortunato2018noisy} as an approximation to the variational Thompson sampling method that we derive. Further, we propose State Aware Noisy Exploration (SANE) which seeks to improve on NoisyNets by allowing a non-uniform perturbation, where the amount of parameter perturbation is conditioned on the state of the agent. This is done with the help of an auxiliary perturbation module, whose output is state dependent and is learnt end to end with gradient descent. We hypothesize that such state-aware noisy exploration is particularly useful in problems where exploration in certain \textit{high risk} states may result in the agent failing badly. We demonstrate the effectiveness of the state-aware exploration method in the off-policy setting by augmenting DQNs with the auxiliary perturbation module.
翻译:Thompson 取样是一种众所周知的在强化学习中平衡勘探和开发的方法。 它要求维持价值行动功能的后方分布; 这对于具有高维状态空间的任务来说,这一般是难以解决的。 我们为使用深网络的DQNs, 其参数因学习的变异噪音分布而扰动的DQNs, 我们为DQNs得出一个变异的汤普森抽样近似点。 我们把成功的NoisyNets方法 解释成与我们所得出的变异汤普森抽样方法的近似点。 此外, 我们提议, 国家意识到 Noisy 探索(SANE), 寻求通过允许非统一状态的扰动来改进Noisy Nets 。 在非统一状态的扰动中, 参数的扰动量取决于代理人的状态。 这是在一个辅助的扰动模块的帮助下完成的, 其输出取决于状态, 并学习以渐渐下降为结束 。 我们假设, 这种州觉的噪音探索对于某些Textit{ 高度风险状态的探索(SANNE) 状态可能会导致代理器的升级的升级政策。 我们展示了Dsurtural- supsuration 政策的效能。