In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environment-specific and another that is shared. Consider a set of environments that share the laws of motion as an example. In this setting, the agent can take a finite amount of reward-free interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide a tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that guarantees any desired planning error up to an unavoidable sub-optimality term, while showcasing a polynomial sample complexity.
翻译:在顺序决策设置中,一个代理商的目标是在大量、可能无限的环境组合中实现系统性的概括化。这种环境以离散的马尔科夫决策程序为模范,同时通过特性矢量代表状态和行动。环境的基本结构允许将过渡动态纳入两个组成部分:一个是环境特有的,另一个是共享的。将一组环境作为范例,分享运动法则。在这个环境中,代理商可以从这些环境的一组环境中采取一定数量的无酬互动。然后,代理商必须能够大致解决最初设置的任何环境所定义的任何规划任务,仅依靠上述相互作用。我们能否设计出一种可实现系统化概括化这一宏伟目标的有效算法?在本文中,我们对这个问题给出了部分肯定的答案。首先,我们通过采用因果关系观点来提供一套系统化的概括化的可移植公式。然后,在具体的结构假设下,我们提供一种简单的学习算法,保证任何理想的规划错误达到不可避免的亚精度术语,同时展示多元抽样的复杂度。