We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, our paradigm leverages the demonstrator's behavior en route to optimality, and in particular, the exploration phase, to obtain consistent reward estimates. We develop simple and efficient reward estimation procedures for demonstrations within a class of upper-confidence-based algorithms, showing that reward estimation gets progressively easier as the regret of the algorithm increases. We match these upper bounds with information-theoretic lower bounds that apply to any demonstrator algorithm, thereby characterizing the optimal tradeoff between exploration and reward estimation. Extensive empirical evaluations on both synthetic data and simulated experimental design data from the natural sciences corroborate our theoretical results.
翻译:我们引入了“反反强盗”问题,即通过观察低风险示威者的学习过程来估计多武装强盗事件奖赏的“反反反强盗”问题。 现有的反反强化学习相关问题的方法假定执行最佳政策,从而受可识别性问题的影响。 相反,我们的范式利用示威者在优化、特别是探索阶段中的行为,以获得一致的奖赏估计。 我们为一类基于高度信任的算法中的示威制定了简单有效的奖赏估计程序,表明奖励估计随着算法的遗憾增加而逐渐变得更加容易。 我们将这些上限与适用于任何示威者算法的信息理论下限相匹配,从而将勘探和奖赏估计之间的最佳取舍定性。 对合成数据和来自自然科学的模拟实验设计数据进行广泛的实证评估证实了我们的理论结果。