Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient $\beta$ that is strictly positive and is independent of $T$) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.
翻译:包含改变政策成本的转换成本被视为强化学习(RL)中的关键衡量标准(RL),除了标准损失(或回报)衡量标准标准标准标准标准(RL)之外,还被视为强化学习(RL)中的关键衡量标准(RL),然而,目前关于转换成本的研究(以美元计,严格正值,不以美元计)主要侧重于静态RL,其中损失分配假定在学习过程中是固定的,因此不考虑损失分配可能不是静止的甚至敌对性的实际假设。尽管对抗性RL更好的模型这种实际情景,但问题仍然存在:如何为对立的RL(或回报)开发一个可辨别有效的算法,用于转换成本?本文首次努力解决这一问题。首先,我们提供了一种更低的遗憾,这表明任何算法的遗憾必须大于$(HS A) 1 (1) 3/3} T* /3} 损失分配可能不是静止的($S, $A) 美元和$(H) 美元是低比值是事件的数量、州、行动和层次。 我们已知的转动的转码(RR) 4) 显示,在基本成本中,我们最容易的转变成本中实现。