We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.
翻译:我们用轨迹偏好来研究“人与人之间”强化学习(RL)的轨迹偏好,在轨迹偏好中,代理人没有在每步获得数字奖励,而只是从人类监督员那里得到对轨迹配对的偏好。代理人的目标是学习人类监督员最喜欢的最佳政策。尽管取得了一些经验,但对基于优惠的RL(PbRL)的理论理解仅限于表格中的情况。在本文中,我们提议为PbRL(PbRL)使用具有一般功能近似值的首个乐观模型算法,该算法用价值目标回归来估计模型,并通过解决乐观的规划问题来计算探索政策。我们的算法实现了 $\ telde{O} (\ R} (\ t) (c) (d)\ operatorname{poly} (H)\ sqrt{K} (d) $, 美元就是根据 Eluder 尺寸和日志覆盖数字来衡量过渡和优惠模式的复杂程度。美元是规划地, 美元, 美元是用来计算我们最接近于直径的轨算值的比值的计算结果。