In this short note we derive a relationship between the Bregman divergence from the current policy to the optimal policy and the suboptimality of the current value function in a regularized Markov decision process. This result has implications for multi-task reinforcement learning, offline reinforcement learning, and regret analysis under function approximation, among others.
翻译:在这个简短的注释中,我们得出了布雷格曼人从现行政策到最佳政策的分歧与正常的马尔科夫决策程序中当前价值功能的不优化之间的关系。 这一结果对多任务强化学习、离线强化学习和功能近似法下的遗憾分析等产生了影响。