We study the problem of continually training an instruction-following agent through feedback provided by users during collaborative interactions. During interaction, human users instruct an agent using natural language, and provide realtime binary feedback as they observe the agent's instruction execution. We cast learning as a contextual bandit problem, converting the user feedback to immediate reward. We evaluate through multiple rounds of human-agent interactions, demonstrating 15.4% absolute improvement in instruction execution over time. We also show our approach is robust to several design variations, and that the feedback signal is roughly equivalent to the learning signal of supervised demonstration data.
翻译:我们研究通过用户在协作互动期间提供的反馈不断培训一个遵循指导的代理人的问题。在互动中,人类用户用自然语言指导一个代理人,并在他们观察该代理人的指令执行时提供实时的二进制反馈。我们把学习作为一个背景强盗问题,将用户反馈转化为立即奖励。我们通过多轮人体剂互动来评估,表明在一段时间内,教学执行方面有15.4%的绝对改进。我们还表明我们的方法对一些设计变异非常有力,反馈信号大致相当于监督演示数据的学习信号。