使用机器教学,在教学加强学习者时调查人类的假设 (Using Machine Teaching to Investigate Human Assumptions when Teaching Reinforcement Learners)

Successful teaching requires an assumption of how the learner learns - how the learner uses experiences from the world to update their internal states. We investigate what expectations people have about a learner when they teach them in an online manner using rewards and punishment. We focus on a common reinforcement learning method, Q-learning, and examine what assumptions people have using a behavioral experiment. To do so, we first establish a normative standard, by formulating the problem as a machine teaching optimization problem. To solve the machine teaching optimization problem, we use a deep learning approximation method which simulates learners in the environment and learns to predict how feedback affects the learner's internal states. What do people assume about a learner's learning and discount rates when they teach them an idealized exploration-exploitation task? In a behavioral experiment, we find that people can teach the task to Q-learners in a relatively efficient and effective manner when the learner uses a small value for its discounting rate and a large value for its learning rate. However, they still are suboptimal. We also find that providing people with real-time updates of how possible feedback would affect the Q-learner's internal states weakly helps them teach. Our results reveal how people teach using evaluative feedback and provide guidance for how engineers should design machine agents in a manner that is intuitive for people.

翻译：成功教学要求假设学习者如何学习 — 学习者如何利用世界经验更新内部状态。我们调查人们在使用奖赏和惩罚以在线方式教育他们时对学习者有什么期望。我们注重共同的强化学习方法, Q- 学习, 并研究人们使用行为实验的假设。为了这样做, 我们首先将问题发展成机器教学优化问题, 从而建立一个规范标准。为了解决机器教学优化问题, 我们使用一种深厚的学习近似方法, 模拟环境中学习者, 并学习预测反馈如何影响学习者的内部状态。人们在教授他们理想化的探索开发任务时, 对学习者的学习和贴现率有何期望? 在行为实验中, 我们发现当学习者使用一个小值的折现率和高值学习率时, 人们可以以相对高效和有效的方式向Q- 学习者传授任务。但是, 他们仍然不够完美。我们还发现, 向人们提供实时更新学习者学习的学习和折扣会如何影响我们内部的工程师的反馈。