We study off-policy evaluation and learning from sequential data in a structured class of Markov decision processes that arise from repeated interactions with an exogenous sequence of arrivals with contexts, which generate unknown individual-level responses to agent actions. This model can be thought of as an offline generalization of contextual bandits with resource constraints. We formalize the relevant causal structure of problems such as dynamic personalized pricing and other operations management problems in the presence of potentially high-dimensional user types. The key insight is that an individual-level response is often not causally affected by the state variable and can therefore easily be generalized across timesteps and states. When this is true, we study implications for (doubly robust) off-policy evaluation and learning by instead leveraging single time-step evaluation, estimating the expectation over a single arrival via data from a population, for fitted-value iteration in a marginal MDP. We study sample complexity and analyze error amplification that leads to the persistence, rather than attenuation, of confounding error over time. In simulations of dynamic and capacitated pricing, we show improved out-of-sample policy performance in this class of relevant problems.
翻译:我们在一个结构化的马尔科夫决策过程中研究离政策评价和从顺序数据中学习,这些过程产生于与外来抵达者与背景的反复互动,产生未知的个人对代理行动的反应。这一模式可以被视为对有资源限制的上层强盗的离线概括。我们正式确定问题的相关因果结构,例如动态个人化定价和其他业务管理问题,并存在潜在的高维用户类型。关键见解是,个人层面的反应往往不会受到国家变量的因果影响,因此很容易跨越时间步骤和状态加以普及。如果确实如此,我们通过利用单一的时间步骤评价来研究(强健的)政策外评价和学习的影响,估计通过人口数据单项抵达的预期值,以便在边际 MDP 中进行符合价值的迭代。我们抽样复杂程度并分析错误的放大作用,从而导致在时间上难以弥补错误的持久性,而不是减弱。在模拟动态和增强能力的定价过程中,我们展示了这一类相关问题的超标性政策绩效。