示范示威的内在动力 (Show me the Way: Intrinsic Motivation from Demonstrations)

The study of exploration in the domain of decision making has a long history but remains actively debated. From the vast literature that addressed this topic for decades under various points of view (e.g., developmental psychology, experimental design, artificial intelligence), intrinsic motivation emerged as a concept that can practically be transferred to artificial agents. Especially, in the recent field of Deep Reinforcement Learning (RL), agents implement such a concept (mainly using a novelty argument) in the shape of an exploration bonus, added to the task reward, that encourages visiting the whole environment. This approach is supported by the large amount of theory on RL for which convergence to optimality assumes exhaustive exploration. Yet, Human Beings and mammals do not exhaustively explore the world and their motivation is not only based on novelty but also on various other factors (e.g., curiosity, fun, style, pleasure, safety, competition, etc.). They optimize for life-long learning and train to learn transferable skills in playgrounds without obvious goals. They also apply innate or learned priors to save time and stay safe. For these reasons, we propose to learn an exploration bonus from demonstrations that could transfer these motivations to an artificial agent with little assumptions about their rationale. Using an inverse RL approach, we show that complex exploration behaviors, reflecting different motivations, can be learnt and efficiently used by RL agents to solve tasks for which exhaustive exploration is prohibitive.

翻译：在决策领域的探索研究具有悠久的历史,但仍是积极辩论。从从各种观点(例如发展心理学、实验设计、人工智能)下处理这一专题长达数十年的大量文献(例如,发展心理学、实验设计、人工智能)来看,内在动力作为一种概念出现,实际上可以转移给人工代理。特别是在最近的深强化学习领域,代理商以勘探奖金的形式实施这一概念(主要使用新颖的论据),加上任务奖励,鼓励访问整个环境。这种方法得到大量关于RL的理论的支持,为此,融合到最佳性需要彻底探索。然而,人类和哺乳动物并不详尽无遗地探索世界,其动机不仅基于新颖,而且还基于其他各种因素(例如,好奇心、乐趣、风格、快感、安全、竞争等),它们优化了终身学习和培训,以便在没有明显目标的情况下在操场学习可转让的技能。它们也可以在时间和保持安全之前应用。为了这些理由,我们提议学习一个探索奖金,通过演示,将这些动机转移到一个能反映复杂探索力的R行为上,我们建议将这些动机转移到一个高效的探险方法,通过一种高效的探险方法来反映其精度。