Standard Markov decision process (MDP) and reinforcement learning algorithms optimize the policy with respect to the expected gain. We propose an algorithm which enables to optimize an alternative objective: the probability that the gain is greater than a given value. The algorithm can be seen as an extension of the value iteration algorithm. We also show how the proposed algorithm could be generalized to use neural networks, similarly to the deep Q learning extension of Q learning.
翻译:标准 Markov 决策程序( MDP) 和强化学习算法( 强化学习算法) 优化了预期收益的政策。 我们提出一个能够优化替代目标的算法: 收益大于给定价值的概率。 该算法可以被视为价值迭代算法的延伸。 我们还展示了如何推广拟议的算法,以使用神经网络,类似于Q学习的深度Q学习扩展。</s>