As humans and animals learn in the natural world, they encounter distributions of entities, situations and events that are far from uniform. Typically, a relatively small set of experiences are encountered frequently, while many important experiences occur only rarely. The highly-skewed, heavy-tailed nature of reality poses particular learning challenges that humans and animals have met by evolving specialised memory systems. By contrast, most popular RL environments and benchmarks involve approximately uniform variation of properties, objects, situations or tasks. How will RL algorithms perform in worlds (like ours) where the distribution of environment features is far less uniform? To explore this question, we develop three complementary RL environments where the agent's experience varies according to a Zipfian (discrete power law) distribution. On these benchmarks, we find that standard Deep RL architectures and algorithms acquire useful knowledge of common situations and tasks, but fail to adequately learn about rarer ones. To understand this failure better, we explore how different aspects of current approaches may be adjusted to help improve performance on rare events, and show that the RL objective function, the agent's memory system and self-supervised learning objectives can all influence an agent's ability to learn from uncommon experiences. Together, these results show that learning robustly from skewed experience is a critical challenge for applying Deep RL methods beyond simulations or laboratories, and our Zipfian environments provide a basis for measuring future progress towards this goal.
翻译:随着人类和动物在自然世界中学习,它们会遇到与自然世界相去甚远的实体、状况和事件分布相去甚远。通常,相对较小的一系列经验会经常遇到,而许多重要经验却很少发生。现实的高度偏斜和繁琐性质给人类和动物带来了特殊的学习挑战,而人类和动物则通过不断演变的专门记忆系统来应对这些挑战。相比之下,最受欢迎的RL环境和基准涉及属性、物体、状况或任务之间大致相同的差异。在环境特征分布极不统一的世界(如我们)中,RL算法如何发挥不同方面的作用?为了探讨这一问题,我们开发了三种互补的RL环境,使代理人的经验因Zipfian(分权法)的分布而异。在这些基准上,我们发现标准的Deep RL 架构和算法获得了关于共同情况和任务的有用知识,但未能充分了解更稀有的知识。为了更好地了解这一缺陷,我们探索当前方法的不同方面如何调整来帮助改进稀有事件的业绩,并显示RL目标功能、代理人的测量R-lian's 和自我校准的Crecreflegilal a res的系统和自我学习所有挑战性环境。