A key challenge for a reinforcement learning (RL) agent is to incorporate external/expert1 advice in its learning. The desired goals of an algorithm that can shape the learning of an RL agent with external advice include (a) maintaining policy invariance; (b) accelerating the learning of the agent; and (c) learning from arbitrary advice [3]. To address this challenge this paper formulates the problem of incorporating external advice in RL as a multi-armed bandit called shaping-bandits. The reward of each arm of shaping bandits corresponds to the return obtained by following the expert or by following a default RL algorithm learning on the true environment reward.We show that directly applying existing bandit and shaping algorithms that do not reason about the non-stationary nature of the underlying returns can lead to poor results. Thus we propose UCB-PIES (UPIES), Racing-PIES (RPIES), and Lazy PIES (LPIES) three different shaping algorithms built on different assumptions that reason about the long-term consequences of following the expert policy or the default RL algorithm. Our experiments in four different settings show that these proposed algorithms achieve the above-mentioned goals whereas the other algorithms fail to do so.
翻译:强化学习代理面临的一个关键挑战是如何将外部/专家建议纳入其学习过程。希望用于外部建议来塑造强化学习代理的算法具有以下目标:(a)保持策略不变性;(b)加速代理的学习;以及(c)从任意建议中学习[3]。为了解决这个挑战,本文将将外部建议在RL中的纳入问题作为一个名为塑造赌博机(shaping-bandits)的多臂赌博机模型的形式化问题。塑形赌博机的每个臂的奖励对应于按照专家或按照默认RL算法在真实环境回报上进行学习所获得的回报。我们发现,直接应用现有的赌博和塑形算法,而不考虑底层回报的非平稳性质可能导致差结果。因此,我们提出了三种不同的塑形算法UCB-PIES(UPIES),Racing-PIES(RPIES)和Lazy PIES(LPIES),这些算法基于不同的基础假设,可考虑按照专家策略或默认RL算法的长期后果。我们在四个不同的设置中进行的实验表明,这些提出的算法实现了上述目标,而其他算法却无法实现。