We extend the provably convergent Full Gradient DQN algorithm for discounted reward Markov decision processes from Avrachenkov et al. (2021) to average reward problems. We experimentally compare widely used RVI Q-Learning with recently proposed Differential Q-Learning in the neural function approximation setting with Full Gradient DQN and DQN. We also extend this to learn Whittle indices for Markovian restless multi-armed bandits. We observe a better convergence rate of the proposed Full Gradient variant across different tasks.
翻译:我们将Avrachenkov等人(2021)的基于折扣奖励的MDP的可证明收敛的完全梯度DQN算法扩展到平均奖励问题。我们在神经函数逼近设置中将广泛使用的RVI Q-Learning与最近提出的差分Q-Learning与完全梯度DQN和DQN进行比较。我们还将其扩展到学习Markovian不安定多臂老虎机的Whittle指数。我们观察到所提出的全梯度变体在不同任务中具有更好的收敛速度。