The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.
翻译:教师-学生框架(TSF)是一个强化学习环境,教师代理通过干预和提供在线演示来保护学生代理人的培训; 假设最理想的是,教师政策有完美的时机和能力来干预学生代理人的学习过程,提供安全保障和探索指导; 然而,在许多现实环境中,教师-学生框架是昂贵的,甚至不可能获得一项良好表现的教师政策; 在这项工作中,我们放宽了对优秀教师的假设,并制定了一种新方法,可以纳入教师的低或低效的专断政策; 我们即时采用一种非政策强化学习算法,称为师生共享控制(TS2C),其中包括教师根据基于轨迹的价值估计进行的干预; 理论分析证实,拟议的TS2C算法在不受教师本身业绩影响的情况下,实现了有效的探索和实质性的安全保障; 对各种持续控制任务的实验表明,我们的方法可以在不同级别上利用教师政策,同时保持低水平的培训费用; 此外,学生政策在长期测试环境中积累的较高报酬方面超过了不完善的教师政策。</s>