We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy, achieving improved performance against comparators that are well-covered by the observations. We also provide an extension to the setting of adversarial linear contextual bandits, and verify the theoretical guarantees via a set of experiments. Our key algorithmic idea is adapting the notion of pessimistic reward estimators that has been recently popular in the context of off-policy reinforcement learning.
翻译:我们在一个被称为“政策外反馈”的局部可观察模式下研究在对抗性强盗问题上的在线学习问题。在这个相继决策问题中,学习者无法直接观察其奖赏,而是看到另一个不为人知的政策平行运行(行为政策 ) 获得的奖赏。 学习者在这种环境下不得不面对另一个挑战:由于他们无法控制的观察有限,学习者可能无法同样很好地估计每项政策的价值。为了解决这个问题,我们提出一套算法,保证在规模上带有任何比较政策与行为政策之间不匹配的自然概念的遗憾界限,实现与观察所充分覆盖的比较者相比的更好的业绩。我们还扩展了对抗性线性线性强盗的设置,并通过一系列实验来核查理论保证。我们的主要算法理念是调整最近在非政策强化学习中流行的悲观奖赏估计者的概念。