多武装强盗非政策评价 (Minimax Off-Policy Evaluation for Multi-Armed Bandits)

We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards, and develop minimax rate-optimal procedures under three settings. First, when the behavior policy is known, we show that the Switch estimator, a method that alternates between the plug-in and importance sampling estimators, is minimax rate-optimal for all sample sizes. Second, when the behavior policy is unknown, we analyze performance in terms of the competitive ratio, thereby revealing a fundamental gap between the settings of known and unknown behavior policies. When the behavior policy is unknown, any estimator must have mean-squared error larger -- relative to the oracle estimator equipped with the knowledge of the behavior policy -- by a multiplicative factor proportional to the support size of the target policy. Moreover, we demonstrate that the plug-in approach achieves this worst-case competitive ratio up to a logarithmic factor. Third, we initiate the study of the partial knowledge setting in which it is assumed that the minimum probability taken by the behavior policy is known. We show that the plug-in estimator is optimal for relatively large values of the minimum probability, but is sub-optimal when the minimum probability is low. In order to remedy this gap, we propose a new estimator based on approximation by Chebyshev polynomials that provably achieves the optimal estimation error. Numerical experiments on both simulated and real data corroborate our theoretical findings.

翻译：我们用约束性奖赏研究多武装强盗模式的离政策评估问题,并在三个设置下开发了最低比率最佳程序。首先,当行为政策为人所知时,我们显示,在插件和重要抽样估测器之间交替使用的一种方法,即开关估测器是所有抽样大小的微型最大比率最佳方法。第二,当行为政策不为人知时,我们从竞争比率的角度分析业绩,从而揭示已知和未知行为政策环境之间的根本差距。当行为政策未知时,任何估测器都必须有平均的偏差更大 -- -- 相对于掌握行为政策知识的甲骨头估测仪而言,我们显示,切换率估计器的偏差更大 -- -- 与目标政策的支持大小成正比成倍。此外,我们证明,加插率方法达到了最差的竞争比率,达到一个对逻辑因素。第三,我们开始研究部分知识的设置,假设行为政策的最低概率是已知的。我们显示,顶端的估测测测算器在最短的概率值上,而最短的极差是我们提出的最差值。