Specifying reward functions for complex tasks like object manipulation or driving is challenging to do by hand. Reward learning seeks to address this by learning a reward model using human feedback on selected query policies. This shifts the burden of reward specification to the optimal design of the queries. We propose a theoretical framework for studying reward learning and the associated optimal experiment design problem. Our framework models rewards and policies as nonparametric functions belonging to subsets of Reproducing Kernel Hilbert Spaces (RKHSs). The learner receives (noisy) oracle access to a true reward and must output a policy that performs well under the true reward. For this setting, we first derive non-asymptotic excess risk bounds for a simple plug-in estimator based on ridge regression. We then solve the query design problem by optimizing these risk bounds with respect to the choice of query set and obtain a finite sample statistical rate, which depends primarily on the eigenvalue spectrum of a certain linear operator on the RKHSs. Despite the generality of these results, our bounds are stronger than previous bounds developed for more specialized problems. We specifically show that the well-studied problem of Gaussian process (GP) bandit optimization is a special case of our framework, and that our bounds either improve or are competitive with known regret guarantees for the Mat\'ern kernel.
翻译:指定物体操纵或驾驶等复杂任务的奖赏功能是难以手动操作的。 奖赏学习寻求通过学习奖赏模式来解决这一问题, 使用人类对选定查询政策的反馈来学习奖赏模式。 这会将奖赏规定的责任转移到最佳的查询设计上。 我们提出一个理论框架框架, 用于研究奖赏学习和相关的最佳实验设计问题。 我们的框架模型和政策是属于Recent Kernel Hilbert Space(RKHSs)子集的非对称功能。 学习者获得真正奖赏的机会( noisy) 或奇迹, 并且必须输出一个在真正奖赏下表现良好的政策。 对于这个环境, 我们首先得出一个非奖赏性的超额风险框框, 用于一个基于山脊回归的简单插头估计器。 然后我们通过优化这些风险框来解决这些设计问题, 并获得一个有限的抽样统计率, 这主要取决于 RKHSHS( RKHS) 某些线性操作员的精度频谱范围。 尽管这些结果很笼统, 我们的界限比以前为比较专门的问题所开发的界限要强。 我们具体地展示了我们最优化的磁框架 。 。</s>