We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, set-max policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm works by successively adding new policies to the set. We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. We empirically evaluate our algorithm on a grid world and also on a set of domains from the DeepMind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. Interestingly, we also show empirically that the sets of policies computed by the algorithm are diverse, leading to different trajectories in the grid world and very distinct locomotion skills in the control suite.
翻译:我们研究如何建立一套政策的问题,这些政策可以共同组成,以解决一系列强化学习任务。每个任务都是不同的奖励功能,其定义是已知特征的线性组合。我们考虑一个具体的政策构成类别,我们称之为制定改进政策(SIPs):根据一套政策和一系列任务,SIP是前者的任何构成,其业绩至少与其所有任务的成员业绩一样好。我们侧重于最保守的SIP的即时性、定式和峰值政策(SMPs),因此我们的分析延伸到任何SIP。这包括已知的政策组合操作员,如普遍政策改进等。我们的主要贡献是一套政策代谢算法,它建立一套政策,以最大限度地提高由此产生的SMP在一系列任务上的最坏情况性能。算法工作是连续地将新的政策添加到全套任务中。我们表明,SMP的最坏的情况表现在每一次反复的反复改进,而我们只有在没有制定能够改进业绩的政策时,才停止这种算法。我们从一个深度的电算法中,我们从一个更精确的电算法中,我们从一个更精确的电算中,也从一个更精确地评估了我们的世界的电算学上的机算学显示。