This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed strategies. Our IRL algorithm identifies optimality and then constructs set-valued estimates of the cost function.To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search. As a real-world example, we illustrate using a YouTube dataset comprising metadata from 190000 videos how the proposed IRL method predicts user engagement in online multimedia platforms with high accuracy. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.
翻译:本文提出了一种应用于贝叶斯停时问题的逆向强化学习(IRL)框架。通过观察贝叶斯决策者的行为,我们提供了鉴定这些行动是否与优化成本函数相一致的必要和充分条件。在贝叶斯(部分观察的)情境下,逆向学习者可以最好地根据观察到的策略确定最优性。我们的IRL算法确定最优性并构建了成本函数的集合值估计。为实现这一IRL目标,我们运用了来自于微观经济学的贝叶斯揭示偏好的新思路。我们使用了停时问题的两个重要例子——顺序假设检验和贝叶斯搜索——来说明所提出的IRL方案。作为一个真实世界的例子,我们使用一个包含来自190000个视频的元数据的YouTube数据集,展示这种IRL方法如何高精度地预测在线多媒体平台中的用户参与情况。最后,针对有限数据集,我们提出了一种IRL检测算法,并给出了其错误概率的有限样本界限。