We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions.We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.Our algorithm obtains an $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$.
翻译:我们用线性功能近似值和对抗性变化的成本功能研究强化学习,这种设置大多是在诸如全面信息反馈或探索性条件等简化假设下考虑的。 我们为具有挑战性的未知动态和土匪反馈总体设置提供了一种计算高效的政策优化算法,其特点是将镜光和最小方形政策评价结合到一个用于计算勘探奖金的辅助 MDP 中。 我们的算法获得了全方位的O(K ⁇ 6/7}) $(K ⁇ 14/15}) 的遗憾约束,大大改进了这一背景下的美元(K ⁇ 14/15}) 最新工艺水平。 此外,我们还根据一个模拟环境的假设向学习者提供了一种相同的算法版本(但除此之外没有作出探索性假设),并证明它获得了全方位O(K ⁇ 2/3}美元的最新遗憾。