In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environment-specific and another one that is shared. Consider a set of environments that share the laws of motion as an illustrative example. In this setting, the agent can take a finite amount of reward-free interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide the first tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that allows us to guarantee any desired planning error up to an unavoidable sub-optimality term, while showcasing a polynomial sample complexity.
翻译:在顺序决策设置中,一个代理商的目的是对大量、可能无限的环境进行系统化的概括化。这种环境以离散的马尔科夫决定过程为模型,同时同时以特性矢量代表状态和行动。环境的基本结构允许将过渡动态分为两个部分:一个是环境特有的,另一个是共享的。将一组环境作为说明性的例子来考虑。在这个环境中,代理商可以从这些环境的一组环境中采取一定数量的无酬互动。然后,代理商必须能够大致解决最初设置的任何环境所定义的任何规划任务,仅依靠上述交互作用。我们能否设计出一种可实现系统化概括这一宏大目标的可实现的高效算法?在本文件中,我们对这个问题给出了部分肯定的答案。首先,我们通过采用因果关系观点来提供第一种系统化的概括化的可感动的公式。然后,在具体的结构假设下,我们提供一种简单的学习算法,使我们能够保证任何预期的规划错误达到不可避免的亚最佳性术语,同时展示一个多元性样本的复杂性。