In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environment-specific and another that is shared. Consider a set of environments that share the laws of motion as an example. In this setting, the agent can take a finite amount of reward-free interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide a tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that guarantees any desired planning error up to an unavoidable sub-optimality term, while showcasing a polynomial sample complexity.
翻译:在序列决策制定环境中,一个代理人旨在在一个可能无限大的环境集合上实现系统泛化。这些环境被建模为具有特征向量表示的离散马尔可夫决策过程,其中状态和动作都被表示。环境的基本结构允许将转移动态因子分解为两个组成部分:一个是特定于环境的,另一个是共享的。例如,考虑一个共享运动定律的环境集合。在这种情况下,代理可以从这些环境的子集中获取有限的无奖励交互。然后,代理必须能够近似解决对原始集合中的任何环境定义的任何计划任务,仅依赖于上述交互。我们是否可以设计一个可证明高效的算法,实现这一雄心勃勃的系统泛化目标?在本文中,我们对这个问题提供部分正面回答。首先,我们采用因果视角提供了一个可行的系统泛化公式。然后,在特定结构假设下,我们提供了一个简单的学习算法,保证任何期望的规划误差高达不可避免的次优项,同时展示了多项式样本复杂性。