Monte Carlo (MC) methods are the most widely used methods to estimate the performance of a policy. Given an interested policy, MC methods give estimates by repeatedly running this policy to collect samples and taking the average of the outcomes. Samples collected during this process are called online samples. To get an accurate estimate, MC methods consume massive online samples. When online samples are expensive, e.g., online recommendations and inventory management, we want to reduce the number of online samples while achieving the same estimate accuracy. To this end, we use off-policy MC methods that evaluate the interested policy by running a different policy called behavior policy. We design a tailored behavior policy such that the variance of the off-policy MC estimator is provably smaller than the ordinary MC estimator. Importantly, this tailored behavior policy can be efficiently learned from existing offline data, i,e., previously logged data, which are much cheaper than online samples. With reduced variance, our off-policy MC method requires fewer online samples to evaluate the performance of a policy compared with the ordinary MC method. Moreover, our off-policy MC estimator is always unbiased.
翻译:蒙特卡罗(Monte Carlo, MC)方法是最常用的评估政策性能的方法。对于给定的政策,MC方法通过反复运行该政策来收集样本,并取得结果的平均值来给出估计。在此过程中收集到的样本称为在线样本。为了得到准确的估计,MC方法需要消耗大量的在线样本。当在线样本比较昂贵时,例如在线推荐和库存管理,我们希望在保持相同的估计准确性的同时减少在线样本数量。为此,我们使用离线MC方法通过运行不同的策略(行为策略)对所需策略进行评估。我们设计出一种特定的行为策略,使得离线MC估计器的方差比普通MC估计器的方差小。重要的是,这种定制的行为策略可以从现有的离线数据(即先前记录的数据)中高效地学习,而这比在线样本要便宜得多。通过减小方差,我们的离线MC方法比普通MC方法需要更少的在线样本来评估政策的表现。此外,我们的离线MC估计器始终是无偏的。