In this paper, we study the Tiered Reinforcement Learning setting, a parallel transfer learning framework, where the goal is to transfer knowledge from the low-tier (source) task to the high-tier (target) task to reduce the exploration risk of the latter while solving the two tasks in parallel. Unlike previous work, we do not assume the low-tier and high-tier tasks share the same dynamics or reward functions, and focus on robust knowledge transfer without prior knowledge on the task similarity. We identify a natural and necessary condition called the "Optimal Value Dominance" for our objective. Under this condition, we propose novel online learning algorithms such that, for the high-tier task, it can achieve constant regret on partial states depending on the task similarity and retain near-optimal regret when the two tasks are dissimilar, while for the low-tier task, it can keep near-optimal without making sacrifice. Moreover, we further study the setting with multiple low-tier tasks, and propose a novel transfer source selection mechanism, which can ensemble the information from all low-tier tasks and allow provable benefits on a much larger state-action space.
翻译:在本文中,我们研究了 " 捆绑强化学习 ",这是一个平行的转移学习框架,目标是将知识从低层次(源)任务转移到高层次(目标)任务,以减少后者的探索风险,同时解决这两项任务。与以前的工作不同,我们并不承担低层次和高层次任务,它们具有相同的动态或奖励功能,而是侧重于在没有事先了解任务相似性的情况下进行强有力的知识转让。我们为我们的目标确定了一个自然和必要的条件,称为“最佳价值主宰”。在此条件下,我们提出了新的在线学习算法,这样,对于高层次的任务,它能够根据任务相似性对部分国家不断表示遗憾,并在两项任务不同时保持近于最佳的遗憾,而对于低层次的任务,它可以保持接近最佳的状态,而无需牺牲。此外,我们进一步研究多重低层次任务的设置,并提出一个新的转移源选择机制,可以汇集所有低层次任务中的信息,并允许在更大的州行动空间上取得可实现的收益。