We derive a concentration bound for a Q-learning algorithm for average cost Markov decision processes based on an equivalent shortest path problem, and compare it numerically with the alternative scheme based on relative value iteration.
翻译:我们得出一个基于相同最短路径问题的Markov决策过程平均成本的Q-学习算法的集中值,并将其与基于相对价值迭代的替代方案进行数字比较。