We consider experiments in dynamical systems where interventions on some experimental units impact other units through a limiting constraint (such as a limited inventory). Despite outsize practical importance, the best estimators for this `Markovian' interference problem are largely heuristic in nature, and their bias is not well understood. We formalize the problem of inference in such experiments as one of policy evaluation. Off-policy estimators, while unbiased, apparently incur a large penalty in variance relative to state-of-the-art heuristics. We introduce an on-policy estimator: the Differences-In-Q's (DQ) estimator. We show that the DQ estimator can in general have exponentially smaller variance than off-policy evaluation. At the same time, its bias is second order in the impact of the intervention. This yields a striking bias-variance tradeoff so that the DQ estimator effectively dominates state-of-the-art alternatives. From a theoretical perspective, we introduce three separate novel techniques that are of independent interest in the theory of Reinforcement Learning (RL). Our empirical evaluation includes a set of experiments on a city-scale ride-hailing simulator.
翻译:我们考虑在动态系统中的实验,对一些实验单位的干预通过有限的限制(如有限的库存)影响其他单位。尽管这个“Markovian”的干预问题具有超大的实际重要性,但这个“Markovian”的干预问题的最佳估计者基本上具有超常性,而且其偏向性没有得到很好理解。我们在诸如政策评估的实验中将推论问题正式化。非政策估测者虽然没有偏见,但显然会受到与最先进的超时技术相比的巨大影响。我们引入了一种在政策上的估测者:差异-Q(DQ)的估测者。我们表明,“DQ”的估测者一般的差别小于非政策评估。同时,它的偏向性是干预影响的第二顺序。这产生了一种惊人的偏差性交易,因此“DQ”估测者有效地支配了最先进的替代方法。从理论上的角度看,我们引入了三种不同的新技术,即“差异-Q-Q”(DQ)估算者。我们表明,“DQ-Q-Q”估计者在加强学习理论中具有独立的兴趣。我们的经验性评估包括一套关于城市规模的实验实验性城市的实验实验实验实验实验实验实验实验实验。