This paper aims at presenting a new application of information geometry to reinforcement learning focusing on dynamic treatment resumes. In a standard framework of reinforcement learning, a Q-function is defined as the conditional expectation of a reward given a state and an action for a single-stage situation. We introduce an equivalence relation, called the policy equivalence, in the space of all the Q-functions. A class of information divergence is defined in the Q-function space for every stage. The main objective is to propose an estimator of the optimal policy function by a method of minimum information divergence based on a dataset of trajectories. In particular, we discuss the $\gamma$-power divergence that is shown to have an advantageous property such that the $\gamma$-power divergence between policy-equivalent Q-functions vanishes. This property essentially works to seek the optimal policy, which is discussed in a framework of a semiparametric model for the Q-function. The specific choices of power index $\gamma$ give interesting relationships of the value function, and the geometric and harmonic means of the Q-function. A numerical experiment demonstrates the performance of the minimum $\gamma$-power divergence method in the context of dynamic treatment regimes.
翻译:本文旨在展示信息几何学的新应用,以强化学习为重点,重点是动态处理恢复; 在强化学习的标准框架内, 功能被定义为有条件地期待对某一状态给予奖励和对某一阶段情况采取行动的有条件期望; 我们在所有功能空间引入了等同关系,称为政策等同关系; 每个阶段的功能空间都界定了一类信息差异; 主要目标是通过基于轨迹数据集的最低限度信息差异的方法,提出一个最佳政策功能的估算符; 特别是, 我们讨论显示具有有利属性的$gamma$-功率差异,如政策等值功能之间消失的等值差异。 该属性基本上是为了寻求最佳政策,在功能半参数模型的框架内讨论。 具体选择权力指数 $\gamma 表示价值功能的有趣关系,以及Q- 功能背景的几何度和调力手段。 一个数字实验方法显示变异性制度的最低表现。