Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.
翻译:最近从人类反馈中学习奖励功能的努力往往使用深层的神经网络,这些网络缺乏透明度,妨碍了我们解释代理人行为或核实一致性的能力。我们探讨了学习内在可解释的树型模式的优点。我们制定了最近提出的从偏好标签中学习奖励树的方法,并表明它与神经网络在挑战高层次任务方面具有广泛的竞争力,对有限的或腐败的数据具有很强的可靠性。我们发现奖励树的学习可以在复杂的环境中有效进行,我们随后考虑为什么应该使用它,表明可解释的奖励结构为追踪、核查和解释提供了相当大的空间。