Reinforcement learning (RL) has traditionally been understood from an episodic perspective; the concept of non-episodic RL, where there is no restart and therefore no reliable recovery, remains elusive. A fundamental question in non-episodic RL is how to measure the performance of a learner and derive algorithms to maximize such performance. Conventional wisdom is to maximize the difference between the average reward received by the learner and the maximal long-term average reward. In this paper, we argue that if the total time budget is relatively limited compared to the complexity of the environment, such comparison may fail to reflect the finite-time optimality of the learner. We propose a family of measures, called $\gamma$-regret, which we believe to better capture the finite-time optimality. We give motivations and derive lower and upper bounds for such measures. Note: A follow-up work (arXiv:2010.00587) has improved both our lower and upper bound, the gap is now closed at $\tilde{\Theta}\left(\frac{\sqrt{SAT}}{(1 - \gamma)^{\frac{1}{2}}}\right)$.
翻译:强化学习(RL)传统上是从零散的角度理解的;非episodic RL(没有重新启动,因此没有可靠的恢复)的概念仍然难以实现。非episodic RL的一个基本问题是如何衡量学习者的表现并得出算法以最大限度地提高这种表现。传统智慧是最大限度地扩大学习者得到的平均奖励与最高长期平均奖励之间的差别。在本文中,我们争论说,如果总预算相对于环境的复杂性而言相对有限,这种比较可能无法反映学习者的有限时间最佳性。我们建议了一套措施,称为$\gamma$-regret,我们认为这组措施可以更好地捕捉到有限的时间最佳性。我们给出了动力,为这类措施提供了下下限和上限。注意:后续工作(arXiv:2010.00587)改善了我们的下限和上限,差距现在在$\telde_left_laft(\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\