Monte Carlo (MC) methods are the most widely used methods to estimate the performance of a policy. Given an interested policy, MC methods give estimates by repeatedly running this policy to collect samples and taking the average of the outcomes. Samples collected during this process are called online samples. To get an accurate estimate, MC methods consume massive online samples. When online samples are expensive, e.g., online recommendations and inventory management, we want to reduce the number of online samples while achieving the same estimate accuracy. To this end, we use off-policy MC methods that evaluate the interested policy by running a different policy called behavior policy. We design a tailored behavior policy such that the variance of the off-policy MC estimator is provably smaller than the ordinary MC estimator. Importantly, this tailored behavior policy can be efficiently learned from existing offline data, i,e., previously logged data, which are much cheaper than online samples. With reduced variance, our off-policy MC method requires fewer online samples to evaluate the performance of a policy compared with the ordinary MC method. Moreover, our off-policy MC estimator is always unbiased.
翻译:蒙特卡洛( Monte Carlo) 方法是用来估计政策绩效的最广泛使用的方法。 根据一种感兴趣的政策, MC 方法通过反复执行这一政策来进行估计来采集样本和平均结果。 在这个过程中收集的样本被称为在线样本。 为了得到准确的估计, MC 方法会消耗大量的在线样本。 当在线样本昂贵时, 例如在线建议和库存管理, 我们想要减少在线样本的数量, 同时实现同样的估计准确性。 为此, 我们使用非政策 MC 方法, 通过实施一个叫作行为政策来评估相关政策。 我们设计了一种有针对性的行为政策, 使非政策性监控测量器的差异比普通的MC 估计器小得多。 重要的是, 这种适应性的行为政策可以有效地从现有的离线数据中学习, 即以前登录的数据, 比在线样本便宜得多 。 由于差异减少, 我们的离政策 MC 方法需要更少的在线样本来评估政策绩效, 与普通的MC 方法相比。 此外, 我们的非政策 MC 估测器总是不偏袒的。