Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes due to external factors (passive non-stationarity), changes induced by interactions with the system itself (active non-stationarity), or both (hybrid non-stationarity). In this work, we take the first steps towards the fundamental challenge of on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity. Towards this goal, we make a higher-order stationarity assumption such that non-stationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importance-weighted instrument-variable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy's past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by real-world applications that exhibit non-stationarity.
翻译:连续决策方法往往建立在基本决策程序是静止的这一基本假设基础上,这限制了这些方法的应用,因为现实世界的问题往往会因外部因素(被动的不静止)、与系统本身的相互作用引起的变化(主动的不静止)或两者兼而有之(杂乱的不静止)而发生变化。在这项工作中,我们在积极、被动或混合的不静止的结构性变化中,对政策和非政策性评价的基本挑战迈出了第一步。为了实现这一目标,我们假设非静止性会随时间变化而变化,但变化的方式是固定的。我们提议采用双重应用反事实推理和新的重要加权工具可变回归的算法,以获得对政策过去业绩变化结构的较低偏差和较低差异估计。最后,我们展示了可如何利用开放性来预测由显示非静止性的实际应用所启发的若干领域未来业绩的可喜结果。