Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.
翻译:高效的勘探是加强学习的长期问题,因为外部回报通常很少或缺少,因此在加强学习方面是一个长期存在的问题。这个问题的流行解决办法是用新信号作为内在回报为代理人提供食物。在这项工作中,我们引入了SEMI,这是一个自我监督的勘探政策,激励代理人最大限度地增加新新新信号:多感知性不一致和行为不协调,可以从两个方面衡量多感性和一致性。前者代表多种感知性投入的不匹配性,而后者则代表不同感官投入下代理人政策的差异。具体地说,我们学习一个校准预测器,以检测多种感知性投入是否一致,这种错误用来测量感知性格的认知性。一个政策模型采用多种感知性观测的不同组合,作为勘探的投入和产出行动不协调性。行动的差异还被用来衡量一致性行动。利用多种感知性游戏的内在回报,SEMI使代理人能够学习技能,以自我监督的方式探索目标性投入,而没有外部回报。我们展示的是,SEMI进一步展示了多种政策性环境的学习。