In reinforcement learning, the classic objectives of maximizing discounted and finite-horizon cumulative rewards are PAC-learnable: There are algorithms that learn a near-optimal policy with high probability using a finite amount of samples and computation. In recent years, researchers have introduced objectives and corresponding reinforcement-learning algorithms beyond the classic cumulative rewards, such as objectives specified as linear temporal logic formulas. However, questions about the PAC-learnability of these new objectives have remained open. This work demonstrates the PAC-learnability of general reinforcement-learning objectives through sufficient conditions for PAC-learnability in two analysis settings. In particular, for the analysis that considers only sample complexity, we prove that if an objective given as an oracle is uniformly continuous, then it is PAC-learnable. Further, for the analysis that considers computational complexity, we prove that if an objective is computable, then it is PAC-learnable. In other words, if a procedure computes successive approximations of the objective's value, then the objective is PAC-learnable. We give three applications of our condition on objectives from the literature with previously unknown PAC-learnability and prove that these objectives are PAC-learnable. Overall, our result helps verify existing objectives' PAC-learnability. Also, as some studied objectives that are not uniformly continuous have been shown to be not PAC-learnable, our results could guide the design of new PAC-learnable objectives.
翻译:在强化学习中,经典的目标是最大化折扣和有限时间段内的累积奖励,这些目标是可通过.PAC方法进行学习的:存在的算法能够使用有限的样本和计算高概率地学习出一种近似最优的策略。近年来,研究人员提出了一些超越经典累积奖励的目标和相应的强化学习算法,如用线性时间逻辑公式表示的目标。然而,对于这些新目标的可PAC-学习性质的问题一直没有解答。本文通过分别分析样本复杂度和计算复杂度的充分条件,从而证明了一般的强化学习目标是可PAC-学习的。特别是,在只考虑样本复杂度的分析中,我们证明如果一个目标作为oracle是一致连续的,那么它就是可PAC-学习的。此外,在考虑计算复杂度的分析中,我们证明如果一个目标是可计算的,那么它就是可PAC-学习的。也就是说,如果一个程序计算目标价值的连续逼近值,那么这个目标是可PAC-学习的。我们使用我们的条件证明了三个文献中以前未知的PAC-可学习目标的应用。总之,我们的结果有助于验证现有的目标的可PAC-学习性质。由于已知不是一致连续的一些研究目标已被证明不是可PAC-学习的,我们的结果可以指导新的可PAC-学习目标的设计。