We study Markov Decision Processes (MDP) wherein states correspond to causal graphs that stochastically generate rewards. In this setup, the learner's goal is to identify atomic interventions that lead to high rewards by intervening on variables at each state. Generalizing the recent causal-bandit framework, the current work develops (simple) regret minimization guarantees for two-stage causal MDPs, with parallel causal graph at each state. We propose an algorithm that achieves an instance dependent regret bound. A key feature of our algorithm is that it utilizes convex optimization to address the exploration problem. We identify classes of instances wherein our regret guarantee is essentially tight, and experimentally validate our theoretical results.
翻译:我们研究Markov决定过程(MDP ), 其中各州对应的因果图表可以产生丰厚的回报。 在这个设置中, 学习者的目标是通过干预每个州的变数来确定导致高回报的原子干预。 概括最近的因果断层框架, 当前的工作为两阶段因果 MDP 制定了( 简单) 最小化的保证, 在每个州都有平行的因果图表。 我们提出了一种实现一个依存的因果图表的算法。 我们算法的一个关键特征是它利用 convex 优化来解决勘探问题。 我们找出了几类我们的遗憾保证基本上很紧的事例, 实验性地验证了我们的理论结果 。