Identifying statistical regularities in solutions to some tasks in multi-task reinforcement learning can accelerate the learning of new tasks. Skill learning offers one way of identifying these regularities by decomposing pre-collected experiences into a sequence of skills. A popular approach to skill learning is maximizing the likelihood of the pre-collected experience with latent variable models, where the latent variables represent the skills. However, there are often many solutions that maximize the likelihood equally well, including degenerate solutions. To address this underspecification, we propose a new objective that combines the maximum likelihood objective with a penalty on the description length of the skills. This penalty incentivizes the skills to maximally extract common structures from the experiences. Empirically, our objective learns skills that solve downstream tasks in fewer samples compared to skills learned from only maximizing likelihood. Further, while most prior works in the offline multi-task setting focus on tasks with low-dimensional observations, our objective can scale to challenging tasks with high-dimensional image observations.
翻译:确定多任务强化学习中某些任务解决方案的统计规律可以加速学习新任务。 技能学习通过将预先收集的经验分解成一系列技能,为识别这些规律提供了一种方法。 一种普及的技能学习方法正在最大限度地利用潜在变量模型(潜在变量代表技能)来尽可能利用预先收集的经验。 然而,往往有许多同样很好的可能性的解决办法,包括退化的解决办法。 为了解决这种不足,我们提出了一个新目标,将最大可能性目标与技能描述长度的处罚结合起来。这一惩罚激励了从经验中最大限度地提取共同结构的技能。 生动地说,我们的目标是在较少的样本中学习解决下游任务的技能,而仅从最大可能性中学习的技能。 此外,在离线多任务中,大多数先前的工作侧重于低维观测任务,我们的目标可以扩大到高维图像观测的挑战性任务。