We introduce the "inverse bandit" problem of estimating the rewards of a multi-armed bandit instance from observing the learning process of a low-regret demonstrator. Existing approaches to the related problem of inverse reinforcement learning assume the execution of an optimal policy, and thereby suffer from an identifiability issue. In contrast, we propose to leverage the demonstrator's behavior en route to optimality, and in particular, the exploration phase, for reward estimation. We begin by establishing a general information-theoretic lower bound under this paradigm that applies to any demonstrator algorithm, which characterizes a fundamental tradeoff between reward estimation and the amount of exploration of the demonstrator. Then, we develop simple and efficient reward estimators for upper-confidence-based demonstrator algorithms that attain the optimal tradeoff, showing in particular that consistent reward estimation -- free of identifiability issues -- is possible under our paradigm. Extensive simulations on both synthetic and semi-synthetic data corroborate our theoretical results.
翻译:我们引入了“反反盗匪”问题, 即通过观察低风险示威者的学习过程来估计多武装强盗事件的奖赏。 现有的反强化学习相关问题的方法假设了最佳政策的实施, 因而也存在可识别性问题。 相反, 我们提议利用示威者在优化、 特别是探索阶段的道路上的行为来估计奖赏。 我们首先根据适用于任何示威者算法的一般信息理论下限, 其特征是奖赏估计和对示威者的探索量之间的基本权衡。 然后, 我们为基于上信任的示威者算法开发简单高效的奖赏估计师, 以达到最佳交易, 特别表明在我们的模式下, 持续的报酬估计 -- -- 不存在识别问题 -- 是可能的。 合成和半合成数据的广泛模拟证实了我们的理论结果。