Q学习能解决多武装班迪吗? (Can Q-learning solve Multi Armed Bantids?)

When a reinforcement learning (RL) method has to decide between several optional policies by solely looking at the received reward, it has to implicitly optimize a Multi-Armed-Bandit (MAB) problem. This arises the question: are current RL algorithms capable of solving MAB problems? We claim that the surprising answer is no. In our experiments we show that in some situations they fail to solve a basic MAB problem, and in many common situations they have a hard time: They suffer from regression in results during training, sensitivity to initialization and high sample complexity. We claim that this stems from variance differences between policies, which causes two problems: The first problem is the "Boring Policy Trap" where each policy have a different implicit exploration depends on its rewards variance, and leaving a boring, or low variance, policy is less likely due to its low implicit exploration. The second problem is the "Manipulative Consultant" problem, where value-estimation functions used in deep RL algorithms such as DQN or deep Actor Critic methods, maximize estimation precision rather than mean rewards, and have a better loss in low-variance policies, which cause the network to converge to a sub-optimal policy. Cognitive experiments on humans showed that noised reward signals may paradoxically improve performance. We explain this using the aforementioned problems, claiming that both humans and algorithms may share similar challenges in decision making. Inspired by this result, we propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies, thus avoiding the two problems without affecting the environment's mean rewards behavior. We demonstrate that the ASRN scheme can dramatically improve the results.

翻译：当一个强化学习(RL)方法需要通过只看得到的奖赏来决定几个备选政策之间时,它必须仅仅通过只看得到的奖赏来在几个备选政策之间作出决定。它必须隐含地优化一个多Armed-Bandit(MAB)问题。这就产生了一个问题:当前的RL算法能否解决MAB问题? 我们声称,在我们的实验中,令人惊讶的答案是否定的。在我们的实验中,我们显示,在某些情况下,它们未能解决基本的MAB问题,在许多常见情况下,它们有一个困难的时间:它们由于培训过程中的结果的倒退、对初始化的敏感度和高度的精度复杂性。我们声称,这来自政策之间的差异,这造成了两个问题:第一个问题是“Brown Polig-Band-Bandd-Bandit (MAB) 问题,其中每个政策有不同的隐含的探索取决于其奖赏差异,而留下一个无趣的或低度的答案。在我们的深层次的算法中,我们可以通过深层次的Acritical 方法来提高价值的估测值功能,我们避免精确度的精确度的精确度而不是奖励政策。