We develop a mathematical framework for solving multi-task reinforcement learning (MTRL) problems based on a type of policy gradient method. The goal in MTRL is to learn a common policy that operates effectively in different environments; these environments have similar (or overlapping) state spaces, but have different rewards and dynamics. We highlight two fundamental challenges in MTRL that are not present in its single task counterpart, and illustrate them with simple examples. We then develop a decentralized entropy-regularized policy gradient method for solving the MTRL problem, and study its finite-time convergence rate. We demonstrate the effectiveness of the proposed method using a series of numerical experiments. These experiments range from small-scale "GridWorld" problems that readily demonstrate the trade-offs involved in multi-task learning to large-scale problems, where common policies are learned to navigate an airborne drone in multiple (simulated) environments.
翻译:我们根据一种政策梯度方法制定了解决多任务强化学习(MTRL)问题的数学框架。MTIL的目标是学习一种在不同环境中有效运行的共同政策;这些环境具有相似(或重叠)的州空间,但有不同的奖赏和动态。我们强调在MTIL中存在两个基本挑战,但单一任务对应方并用简单的例子加以说明。然后我们开发一种分散式的加密常规政策梯度方法,以解决MTIL问题,并研究其有限的时间趋同率。我们用一系列数字实验来展示拟议方法的有效性。这些实验包括小规模的“地球”问题,它们很容易地展示在多任务学习中与大规模问题之间的权衡,在这些问题上,共同的政策是在多种(模拟的)环境中驾驶空中无人驾驶飞机。