For adversarial imitation learning algorithms (AILs), no true rewards are obtained from the environment for learning the strategy. However, the pseudo rewards based on the output of the discriminator are still required. Given the implicit reward bias problem in AILs, we design several representative reward function shapes and compare their performances by large-scale experiments. To ensure our results' reliability, we conduct the experiments on a series of Mujoco and Box2D continuous control tasks based on four different AILs. Besides, we also compare the performance of various reward function shapes using varying numbers of expert trajectories. The empirical results reveal that the positive logarithmic reward function works well in typical continuous control tasks. In contrast, the so-called unbiased reward function is limited to specific kinds of tasks. Furthermore, several designed reward functions perform excellently in these environments as well.
翻译:对于对抗性模拟学习算法(AILs),没有从环境中获得学习策略的真正奖励。然而,仍然需要基于歧视者产出的假奖赏。鉴于AILs隐含的奖赏偏见问题,我们设计了几种有代表性的奖赏功能,并用大规模实验来比较其表现。为了确保结果的可靠性,我们根据四种不同的AILs对一系列穆乔科和Box2D连续控制任务进行实验。此外,我们还利用不同数量的专家轨迹比较各种奖赏功能的性能。实证结果显示,正对数奖赏功能在典型的持续控制任务中运作良好。相比之下,所谓的不带偏见的奖赏功能仅限于特定的任务种类。此外,一些设计好的奖赏功能在这些环境中也非常出色。