We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms in the context of continuous-control, when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, but this is not a realistic setting. Indeed, would this reward function be available, it could then directly be used for policy training and imitation would not be necessary. To tackle this mostly ignored problem, we propose a number of possible proxies to the external reward. We evaluate them in an extensive empirical study (more than 10'000 agents across 9 environments) and make practical recommendations for selecting HPs. Our results show that while imitation learning algorithms are sensitive to HP choices, it is often possible to select good enough HPs through a proxy to the reward function.
翻译:在连续控制的背景下,当演示专家的基本奖赏功能在任何时候都无法观察到时,我们解决了为模拟学习算法调整超参数(HPs)的问题。模拟学习的大量文献大多认为这种奖赏功能可用于HP的选择,但这不是一个现实的环境。事实上,如果有这种奖赏功能,那么它就可以直接用于政策培训,而模仿就没有必要了。为了解决这个大多被忽视的问题,我们提议了一些可能替代外部奖赏的办法。我们在一项广泛的实证研究中评估它们(在9个环境中有10 000多个代理人),并为选择HP提出实用的建议。我们的结果显示,虽然模仿学习算法对HP的选择很敏感,但往往可以通过奖励功能的替代物选择足够的优秀的HP。