We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preference-based trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension $d$. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of $\tilde {\mathcal{O}}\left( SH d \log (T / \delta) \sqrt{T} \right)$. We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee $\widetilde{\mathcal{O}}((\sqrt{d} + H^2 + |\mathcal{S}|)\sqrt{dT} +\sqrt{|\mathcal{S}||\mathcal{A}|TH} )$. To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problems with trajectory preferences.
翻译:我们考虑了基于优惠的强化学习(PbRL)问题。 与传统的强化学习不同, 代理商只能从 1 位 (0/1) 优于轨迹配对而不是绝对奖赏来得到反馈。 传统的 RL 框架的成功关键地依赖于基础的代理回报模式, 然而, 这取决于系统设计者如何准确地表达适当的奖赏功能, 并且常常是非三重任务。 我们框架的主要新颖之处是能够学习基于优惠的轨迹反馈, 从而消除手工艺数字奖赏模式的需要。 本文为 PbRL 问题设置了一个正式框架, 而非马克罗维奖奖奖。 在这种框架中, 轨迹偏好通过通用的维度线性模型编码。 假设人们知道过渡模式, 我们然后提出一种算法, 几乎最有遗憾的保证$; left{ Oleforld (T/\ delta) 和 ral_\ rock} (rock} 我们最接近的 Rqral_\\\\\\\\\\ mal k ors; 我们最接近最有可能的算法。