We study the sample complexity of teaching, termed as "teaching dimension" (TDim) in the literature, for the teaching-by-reinforcement paradigm, where the teacher guides the student through rewards. This is distinct from the teaching-by-demonstration paradigm motivated by robotics applications, where the teacher teaches by providing demonstrations of state/action trajectories. The teaching-by-reinforcement paradigm applies to a wider range of real-world settings where a demonstration is inconvenient, but has not been studied systematically. In this paper, we focus on a specific family of reinforcement learning algorithms, Q-learning, and characterize the TDim under different teachers with varying control power over the environment, and present matching optimal teaching algorithms. Our TDim results provide the minimum number of samples needed for reinforcement learning, and we discuss their connections to standard PAC-style RL sample complexity and teaching-by-demonstration sample complexity results. Our teaching algorithms have the potential to speed up RL agent learning in applications where a helpful teacher is available.
翻译:我们研究教学的抽样复杂性,在文献中被称为“教学层面”(TDim),在教学的逐个强化范式中,教师通过奖励引导学生。这不同于机器人应用所激励的逐个示范式教学模式,教师通过展示状态/动作轨迹进行教学。逐个教学模式适用于更广泛的现实世界环境中的演示不方便,但没有系统研究。在本文中,我们侧重于一个加强学习算法、Q-学习的具体家庭,将TDim定性为对环境有不同控制力的不同教师,并提出了相应的最佳教学算法。我们的TDim结果提供了强化学习所需的最低样本数量,我们讨论了它们与标准PAC-型RL样本复杂性和逐个教学的样本复杂性结果的联系。我们的教学算法有可能在有帮助教师的应用程序中加快RL代理的学习速度。