One of the key issues for imitation learning lies in making policy learned from limited samples to generalize well in the whole state-action space. This problem is much more severe in high-dimensional state environments, such as game playing with raw pixel inputs. Under this situation, even state-of-the-art adversary-based imitation learning algorithms fail. Through empirical studies, we find that the main cause lies in the failure of training a powerful discriminator to generate meaningful rewards in high-dimensional environments. Although it seems that dimensionality reduction can help, a straightforward application of off-the-shelf methods cannot achieve good performance. In this work, we show in theory that the balance between dimensionality reduction and discriminative training is essential for effective learning. To achieve this target, we propose HashReward, which utilizes the idea of supervised hashing to realize such an ideal balance. Experimental results show that HashReward could outperform state-of-the-art methods for a large gap under the challenging high-dimensional environments.
翻译:模仿学习的一个关键问题在于从有限的样本中学习政策,以便在整个国家行动空间全面推广。这个问题在高维状态环境中更为严重,比如使用原始像素投入的游戏。在这种情况下,即使是最先进的以对手为基础的模拟学习算法也失败了。通过经验研究,我们发现,主要原因在于未能训练一个强大的歧视者在高维环境中产生有意义的奖励。虽然减少维度似乎有帮助,但直接应用现成的方法并不能取得良好的效果。在这项工作中,我们从理论上表明,在有效学习方面,在减少维度和歧视性培训之间保持平衡是不可或缺的。为了实现这一目标,我们建议Hash Reward, 利用监督的“散射”理念来实现这种理想的平衡。实验结果表明,Hash Reward可以超越挑战性高维环境中的巨大差距的“最新”方法。