What is the action sequence aa'a" that was likely responsible for reaching state s"' (from state s) in 3 steps? Addressing such questions is important in causal reasoning and in reinforcement learning. Inverse "MDP" models p(aa'a"|ss"') can be used to answer them. In the traditional "forward" view, transition "matrix" p(s'|sa) and policy {\pi}(a|s) uniquely determine "everything": the whole dynamics p(as'a's"a"...|s), and with it, the action-conditional state process p(s's"...|saa'a"), the multi-step inverse models p(aa'a"...|ss^i), etc. If the latter is our primary concern, a natural question, analogous to the forward case is to which extent 1-step inverse model p(a|ss') plus policy {\pi}(a|s) determine the multi-step inverse models or even the whole dynamics. In other words, can forward models be inferred from inverse models or even be side-stepped. This work addresses this question and variations thereof, and also whether there are efficient decision/inference algorithms for this.
翻译:在传统的“ 向前” 观点中, 过渡“ 矩阵” p (s' ⁇ s) 和政策 & policy } (a ⁇ s) 独特地决定了“ 一切 ” : 整个动态 p(as'a's)......(a ⁇ s) 和它, 行动状态进程 p(s's saa'a) ) 很重要 。换句话说, 行动状态进程 p(s's saa'a), 多步反步模式 p(a'a). ⁇ s si) 等可以用来回答这些问题。 如果后者是我们的主要关切, 一个自然问题, 与前方案例相似的是, 1 步反步 模式 p(a's) 和政策 (a}) 独特地决定了多步反步模式, 甚至整个动态。 换句话说, 前进模式和跨步模式(a'a sa'a sa'a), 等多步反步模式(a'a), 多步的模型, 或决定的进阶(surf) 或进阶(s) 进阶问题。