Humans show an innate ability to learn the regularities of the world through interaction. By performing experiments in our environment, we are able to discern the causal factors of variation and infer how they affect the dynamics of our world. Analogously, here we attempt to equip reinforcement learning agents with the ability to perform experiments that facilitate a categorization of the rolled-out trajectories, and to subsequently infer the causal factors of the environment in a hierarchical manner. We introduce a novel intrinsic reward, called causal curiosity, and show that it allows our agents to learn optimal sequences of actions, and to discover causal factors in the dynamics. The learned behavior allows the agent to infer a binary quantized representation for the ground-truth causal factors in every environment. Additionally, we find that these experimental behaviors are semantically meaningful (e.g., to differentiate between heavy and light blocks, our agents learn to lift them), and are learnt in a self-supervised manner with approximately 2.5 times less data than conventional supervised planners. We show that these behaviors can be re-purposed and fine-tuned (e.g., from lifting to pushing or other downstream tasks). Finally, we show that the knowledge of causal factor representations aids zero-shot learning for more complex tasks.
翻译:人类通过互动来了解世界规律的内在能力。 通过在我们的环境下进行实验, 我们能够辨别变化的因果因素, 并推断它们如何影响我们世界的动态。 类似地, 我们试图让强化学习机构有能力进行实验, 以便于对展出轨迹进行分类, 并随后以等级方式推断环境的因果因素。 我们引入了一种新的内在奖赏, 称为因果好奇心, 并显示它允许我们的代理机构学习最优的行动序列, 并发现动态中的因果因素。 学习后的行为使得代理机构能够推断出每个环境中地面图象因果关系因素的二进制量化代表。 此外, 我们发现这些实验行为具有内在意义( 例如,区分重块和轻块, 我们的代理机构学会提升它们), 并且以自我控制的方式学习, 其数据比常规监管的规划者要少大约2.5倍。 我们证明这些行为可以被重新定位和精确调整( 例如, 从提升到更复杂的因果分析 ) 。 最后, 我们发现这些实验行为可以显示, 从提升到更复杂的因数 。