The discovery of reusable sub-routines simplifies decision-making and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in a purely unsupervised fashion through observing state-action trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about sub-routine boundary points in light of new incoming information. In this work we propose SloTTAr, a fully parallel approach that integrates sequence processing Transformers with a Slot Attention module and adaptive computation for learning about the number of such sub-routines in an unsupervised fashion. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, even for sequences containing variable amounts of sub-routines, while being up to 7x faster to train on existing benchmarks.
翻译:在复杂的强化学习问题中,发现可重复使用的次路径简化了决策和规划。以前的方法建议通过观察从执行政策中收集的州-行动轨迹,以完全不受监督的方式学习这种时间抽象。然而,目前的限制是,它们以完全顺序的方式处理每一轨迹,从而阻止它们根据新获得的信息修改早先关于子路径边界点的决定。在这项工作中,我们提议SloTTAR, 一种完全平行的方法,将序列处理变换器与一个Slot注意模块结合起来,并采用适应性计算方法,以在不受监督的方式了解这类次轨迹的数量。我们证明SloTTRA如何能够在发现边界点方面超越强的基线,即使是在含有可变分路径的序列,同时速度高达7x,以对现有基准进行培训。