Off-policy evaluation (OPE) aims to estimate the performance of hypothetical policies using data generated by a different policy. Because of its huge potential impact in practice, there has been growing research interest in this field. There is, however, no real-world public dataset that enables the evaluation of OPE, making its experimental studies unrealistic and irreproducible. With the goal of enabling realistic and reproducible OPE research, we present Open Bandit Dataset, a public logged bandit dataset collected on a large-scale fashion e-commerce platform, ZOZOTOWN. Our dataset is unique in that it contains a set of multiple logged bandit datasets collected by running different policies on the same platform. This enables experimental comparisons of different OPE estimators for the first time. We also develop Python software called Open Bandit Pipeline to streamline and standardize the implementation of batch bandit algorithms and OPE. Our open data and software will contribute to fair and transparent OPE research and help the community identify fruitful research directions. We provide extensive benchmark experiments of existing OPE estimators using our dataset and software. The results open up essential challenges and new avenues for future OPE research.
翻译:离岸政策评价(OPE)旨在利用不同政策产生的数据来估计假设政策的绩效。由于它在实践中具有巨大的潜在影响,因此对这一领域的研究兴趣越来越大。然而,没有真实的世界公共数据集能够对OPE进行评估,从而使其实验性研究变得不现实和不可复制。为了能够进行现实和可复制的OPE研究,我们提出了开放土匪数据集,这是在大型时装电子商务平台ZOZOTOWN 上收集的公共记录式土匪数据集。我们的数据集的独特之处在于它包含通过在同一平台上执行不同政策收集的多套多条登录的土匪数据集。这便于首次对OPE的各种不同估计者进行实验性比较。我们还开发了称为开放土匪管道的Python软件,以简化和规范实施批量土匪算法和OPE。我们的公开数据和软件将有助于OPE研究的公平和透明的研究,并帮助社区确定富有成果的研究方向。我们为使用我们的数据设置和软件的开放途径的现有OPE估计者提供了广泛的基准实验。