Exploration and credit assignment under sparse rewards are still challenging problems. We argue that these challenges arise in part due to the intrinsic rigidity of operating at the level of actions. Actions can precisely define how to perform an activity but are ill-suited to describe what activity to perform. Instead, causal effects are inherently composable and temporally abstract, making them ideal for descriptive tasks. By leveraging a hierarchy of causal effects, this study aims to expedite the learning of task-specific behavior and aid exploration. Borrowing counterfactual and normality measures from causal literature, we disentangle controllable effects from effects caused by other dynamics of the environment. We propose CEHRL, a hierarchical method that models the distribution of controllable effects using a Variational Autoencoder. This distribution is used by a high-level policy to 1) explore the environment via random effect exploration so that novel effects are continuously discovered and learned, and to 2) learn task-specific behavior by prioritizing the effects that maximize a given reward function. In comparison to exploring with random actions, experimental results show that random effect exploration is a more efficient mechanism and that by assigning credit to few effects rather than many actions, CEHRL learns tasks more rapidly.
翻译:在微薄报酬下勘探和信用分配仍然是具有挑战性的问题。我们争辩说,这些挑战之所以出现,部分是由于在行动层面操作的内在僵硬性。行动可以精确地界定如何进行一项活动,但不适合描述所从事活动。相反,因果效应本质上是可折叠的,从时间上来说是抽象的,因此适合描述性任务。通过利用因果效应等级,本项研究旨在加速学习特定任务行为和援助探索。从因果文献中借用反事实和正常性措施,我们从环境其他动态的影响中解开可控制的效果。我们建议CEHRL,这是一种等级化方法,用来模拟可控效果的分配,使用变式自动编码器。高层次政策使用这种分配方法,以随机效果探索的方式探索环境,以便不断发现和学习新效果,以及2)通过优先考虑使给付酬功能最大化的效果来学习具体任务行为。与随机行动相比,实验结果表明随机效果探索是一种效率更高的机制,通过将信用分配到少数效果而不是许多迅速行动,CEL学习任务。