We develop confidence bounds that hold uniformly over time for off-policy evaluation in the contextual bandit setting. These confidence sequences are based on recent ideas from martingale analysis and are non-asymptotic, non-parametric, and valid at arbitrary stopping times. We provide algorithms for computing these confidence sequences that strike a good balance between computational and statistical efficiency. We empirically demonstrate the tightness of our approach in terms of failure probability and width and apply it to the "gated deployment" problem of safely upgrading a production contextual bandit system.
翻译:我们开发了在时间上一致维持在背景土匪环境中非政策评价的互信界限。 这些互信序列基于来自马丁格尔分析的最新想法,并且不是无药可治的、非参数的、在任意停止时有效。 我们为计算这些在计算和统计效率之间取得良好平衡的互信序列提供了算法。 我们从经验上证明了我们的方法在失败概率和宽度方面的紧凑性,并将其应用到安全升级生产连带土匪系统的“固定部署”问题上。