In this paper we examine the problem of determining demonstration sufficiency for AI agents that learn from demonstrations: how can an AI agent self-assess whether it has received enough demonstrations from an expert to ensure a desired level of performance? To address this problem we propose a novel self-assessment approach based on Bayesian inverse reinforcement learning and value-at-risk to enable agents that learn from demonstrations to compute high-confidence bounds on their performance and use these bounds to determine when they have a sufficient number of demonstrations. We propose and evaluate two definitions of sufficiency: (1) normalized expected value difference, which measures regret with respect to the expert's unobserved reward function, and (2) improvement over a baseline policy. We demonstrate how to formulate high-confidence bounds on both of these metrics. We evaluate our approach in simulation and demonstrate the feasibility of developing an AI system that can accurately evaluate whether it has received sufficient training data to guarantee, with high confidence, that it can match an expert's performance or surpass the performance of a baseline policy within some desired safety threshold.
翻译:在本文中,我们研究了确定从示威中学习的AI代理人的示范充分性的问题:AI代理自我评估如何能从专家那里得到足够的演示以确保达到理想的业绩水平?为了解决这个问题,我们提议一种基于巴伊西亚反向强化学习和风险价值的新颖的自我评估方法,使从示威中学习的代理人能够根据自己的表现来计算高度信任约束,并利用这些界限来确定他们何时有足够的演示。我们提议并评价两个关于充分性的定义:(1) 标准化的预期价值差异,这些措施对专家的未观察到的奖励功能感到遗憾,(2) 改进基线政策。我们展示如何就这两种指标制定高度信任的界限。我们评价我们的模拟方法,并展示开发AI系统的可行性,以便准确地评价它是否已经获得足够的培训数据,从而以高度自信保证它能够与专家的业绩相匹配,或者超过某一理想的安全门槛范围内的基准政策的执行情况。