Animals exhibit an innate ability to learn regularities of the world through interaction. By performing experiments in their environment, they are able to discern the causal factors of variation and infer how they affect the world's dynamics. Inspired by this, we attempt to equip reinforcement learning agents with the ability to perform experiments that facilitate a categorization of the rolled-out trajectories, and to subsequently infer the causal factors of the environment in a hierarchical manner. We introduce {\em causal curiosity}, a novel intrinsic reward, and show that it allows our agents to learn optimal sequences of actions and discover causal factors in the dynamics of the environment. The learned behavior allows the agents to infer a binary quantized representation for the ground-truth causal factors in every environment. Additionally, we find that these experimental behaviors are semantically meaningful (e.g., our agents learn to lift blocks to categorize them by weight), and are learnt in a self-supervised manner with approximately 2.5 times less data than conventional supervised planners. We show that these behaviors can be re-purposed and fine-tuned (e.g., from lifting to pushing or other downstream tasks). Finally, we show that the knowledge of causal factor representations aids zero-shot learning for more complex tasks. Visit https://sites.google.com/usc.edu/causal-curiosity/home for website.
翻译:动物表现出通过互动学习世界规律的天生能力。 通过在环境中进行实验, 它们能够辨别变化的因果因素, 并推断它们如何影响世界的动态。 受此启发, 我们试图让强化学习机构有能力进行实验, 从而方便对展出轨迹进行分类, 并随后以分级方式推断环境的因果因素。 我们引入了一种新颖的内在奖赏, 并显示它允许我们的代理机构学习最优的行动序列, 并发现环境动态中的因果因素。 学习后的行为使得代理机构可以推断出每个环境中地面图象因果因素的二进制量化表示。 此外, 我们发现这些实验性的行为具有内涵意义( 例如, 我们的代理机构学会按重量来提升区块的分类 ), 并且以自我监督的方式学习比常规监管规划者少2.5倍的数据。 我们显示, 这些行为可以被重新定位和精确地调整( 例如, 从提升/ 方向 学习 或下游任务 ) 。 最后我们发现, 用于 组织/ 更复杂 的 的 / 组织/ 组织/ 。 学习 组织 。 最后显示 的 的 组织/ 。 的 解释/ 结构/ 的 。 。