Reinforcement learning can solve decision-making problems and train an agent to behave in an environment according to a predesigned reward function. However, such an approach becomes very problematic if the reward is too sparse and the agent does not come across the reward during the environmental exploration. The solution to such a problem may be in equipping the agent with an intrinsic motivation, which will provide informed exploration, during which the agent is likely to also encounter external reward. Novelty detection is one of the promising branches of intrinsic motivation research. We present Self-supervised Network Distillation (SND), a class of internal motivation algorithms based on the distillation error as a novelty indicator, where the target model is trained using self-supervised learning. We adapted three existing self-supervised methods for this purpose and experimentally tested them on a set of ten environments that are considered difficult to explore. The results show that our approach achieves faster growth and higher external reward for the same training time compared to the baseline models, which implies improved exploration in a very sparse reward environment.
翻译:强化学习可以解决决策问题,并培训代理人根据预先设计的奖励功能在环境中行事。然而,如果奖励过于稀少,而且代理人在环境勘探期间没有遇到奖励,这种办法就非常成问题。解决问题的办法可能是使代理人具备内在动机,从而提供知情的探索,在此期间代理人也有可能获得外部奖励。新颖的发现是内在动机研究的有希望的分支之一。我们提出“自我监督的网络蒸馏”(SND),这是基于蒸馏错误的一种内部激励算法,是一种新颖的指标,其中目标模型是利用自我监督的学习来培训的。我们为此调整了三种现有的自我监督方法,并在一套被认为难以探索的10种环境中实验测试了这些方法。结果显示,与基线模型相比,我们的方法取得了更快的增长和较高的外部奖励,这意味着在非常稀少的奖励环境中进行更好的探索。