This paper deals with solving continuous time, state and action optimization problems in stochastic settings, using reinforcement learning algorithms, and considers the policy evaluation process. We prove that standard learning algorithms based on the discretized temporal difference are doomed to fail when the time discretization tends to zero, because of the stochastic part. We propose a variance-reduction correction of the temporal difference, leading to new learning algorithms that are stable with respect to vanishing time steps. This allows us to give theoretical guarantees of convergence of our algorithms to the solutions of continuous stochastic optimization problems.
 翻译:本文涉及解决随机环境的持续时间、状态和行动优化问题,使用强化学习算法,并审议政策评估过程。 我们证明基于分解时间差异的标准学习算法注定会失败,因为时间分解往往为零,因为分解部分是随机的。 我们建议减少时间差异,导致新的学习算法,在消散时间步骤方面保持稳定。 这使得我们可以从理论上保证我们的算法与持续随机优化问题的解决办法趋同。