We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret.
翻译:我们研究的是强化学习理论(RL ), 学习者在一集的结尾只收到一次二进制反馈。 虽然这是一个极端的理论测试案例,但可以说它比学习者在每一步都得到反馈的传统要求更能代表现实世界应用。 事实上,在许多增强学习的实际应用中,如自驾汽车和机器人,比较容易评估学习者的完整轨迹是“好”还是“坏 ”, 但却更难在每一步都提供奖赏信号。 为了证明学习在这种更具挑战性的环境中是可能的,我们研究的是轨迹标签是由未知的参数模型生成的案例,并提供统计和计算效率高的算法,从而实现亚线性遗憾。