We propose a reinforcement learning (RL) approach to compute the expression of quasi-stationary distribution. Based on the fixed-point formulation of quasi-stationary distribution, we minimize the KL-divergence of two Markovian path distributions induced by the candidate distribution and the true target distribution. To solve this challenging minimization problem by gradient descent, we apply the reinforcement learning technique by introducing the reward and value functions. We derive the corresponding policy gradient theorem and design an actor-critic algorithm to learn the optimal solution and the value function. The numerical examples of finite state Markov chain are tested to demonstrate the new method.
翻译:我们建议一种强化学习(RL)法来计算准静止分布的表达方式。 根据准静止分布的固定点公式,我们尽量减少候选人分布和真正目标分布引发的两种马尔科维亚路径分布的KL-divergence。为了通过梯度下降解决这一挑战性最小化问题,我们采用强化学习技术,引入奖励和价值功能。我们得出相应的政策梯度定理,并设计一种演员-批评算法来学习最佳解决方案和价值函数。对限定状态马尔科夫链的数字示例进行了测试,以展示新方法。