This paper formalises the problem of online algorithm selection in the context of Reinforcement Learning. The setup is as follows: given an episodic task and a finite number of off-policy RL algorithms, a meta-algorithm has to decide which RL algorithm is in control during the next episode so as to maximize the expected return. The article presents a novel meta-algorithm, called Epochal Stochastic Bandit Algorithm Selection (ESBAS). Its principle is to freeze the policy updates at each epoch, and to leave a rebooted stochastic bandit in charge of the algorithm selection. Under some assumptions, a thorough theoretical analysis demonstrates its near-optimality considering the structural sampling budget limitations. ESBAS is first empirically evaluated on a dialogue task where it is shown to outperform each individual algorithm in most configurations. ESBAS is then adapted to a true online setting where algorithms update their policies after each transition, which we call SSBAS. SSBAS is evaluated on a fruit collection task where it is shown to adapt the stepsize parameter more efficiently than the classical hyperbolic decay, and on an Atari game, where it improves the performance by a wide margin.
翻译:本文将强化学习背景下的在线算法选择问题正式化。 设置如下: 考虑到一个偶发任务和有限的非政策 RL 算法数量, 元数算法必须决定下一个插曲中哪个 RL 算法在下一个插曲中处于控制之中, 以便最大限度地实现预期的回报。 文章展示了一个新的元数算法, 名为 Epochal Stopchatstic Bandom Algorithm 选择( ESBAS ) 。 它的原则是冻结每个切口的政策更新, 并留一个重新启动的随机土匪来负责算法选择。 根据一些假设, 彻底的理论分析显示其接近最佳性, 考虑到结构抽样预算限制 。 ESBAS 首次对对话任务进行了实证性评估, 其中显示它超越了大多数配置中的每个单个算法。 然后, ESBASS 将适应一个真正的在线设置, 在每次转换后更新其政策的算法, 我们称之为 SSASS。 SSAS 正在对一个水果收集任务进行评估, 在那里显示它能够以更高效的阶梯化参数, 。