We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying structure across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.
翻译:我们在网上元加强学习(RL)模式下研究模型不确定性下的规划问题,在这种模式下,向代理机构提供一系列相关任务,每个任务互动有限。该代理机构可以利用在每项任务和跨任务方面的经验来估计过渡模式和任务分配情况。我们提出一种算法,将各项任务的基本结构元化,利用它来规划每项任务,将规划损失的遗憾排在上方。我们的界限表明,随着任务数量的增加和任务更加相似,对任务的平均遗憾会减少。在典型的单一任务设置中,规划前景应该取决于估计模型的准确性,即取决于任务内样本的数量。我们将这一结果概括为元-RL,并研究规划前景对任务数量的这种依赖性。根据我们的理论研究结果,我们得出选择缓慢增长的贴现因素的理论,我们从经验上验证其重要性。