We introduce a theory of reinforcement learning (RL) in which the learner receives feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sub-linear regret.
翻译:我们引入了强化学习理论(RL ), 学习者在事件结束时只能收到一次反馈。 虽然这是一个极端的理论测试案例,但可以说它比学习者在每个阶段都能收到反馈的传统要求更能代表现实世界应用。 事实上,在许多真实世界的强化学习应用中,比如自驾汽车和机器人,比较容易评估学习者完整的轨道是“好”还是“坏 ”, 但更难在每个步骤上提供奖赏信号。 为了证明学习在这种更具挑战性的环境中是可能的,我们研究的是轨迹标签是由未知的参数模型生成的,并提供统计和计算效率高的算法,从而实现亚线性遗憾。