One of the most successful paradigms for reward learning uses human feedback in the form of comparisons. Although these methods hold promise, human comparison labeling is expensive and time consuming, constituting a major bottleneck to their broader applicability. Our insight is that we can greatly improve how effectively human time is used in these approaches by batching comparisons together, rather than having the human label each comparison individually. To do so, we leverage data dimensionality-reduction and visualization techniques to provide the human with a interactive GUI displaying the state space, in which the user can label subportions of the state space. Across some simple Mujoco tasks, we show that this high-level approach holds promise and is able to greatly increase the performance of the resulting agents, provided the same amount of human labeling time.
翻译:奖励学习最成功的范例之一是以比较的形式利用人类的反馈。虽然这些方法很有希望,但人类比较标签成本昂贵,耗费时间,对其更广泛的适用性构成一个重大瓶颈。我们的洞察力是,我们可以通过将比较组合在一起,而不是将人类标签单独进行比较,大大改进人类时间在这些方法中的使用效率。为此,我们利用数据维度减少和可视化技术,为人类提供一个互动的图形界面,展示国家空间的面积,用户可以在其中标出国家空间的子端。 在一些简单的Mujoco任务中,我们表明,这种高层次的方法很有希望,并且能够大大提高由此产生的代理物的性能,同时提供同样数量的人类标签时间。