Frozen models trained to mimic static datasets can never improve their performance. Models that can employ internet-retrieval for up-to-date information and obtain feedback from humans during deployment provide the promise of both adapting to new information, and improving their performance. In this work we study how to improve internet-driven conversational skills in such a learning framework. We collect deployment data, which we make publicly available, of human interactions, and collect various types of human feedback -- including binary quality measurements, free-form text feedback, and fine-grained reasons for failure. We then study various algorithms for improving from such feedback, including standard supervised learning, rejection sampling, model-guiding and reward-based learning, in order to make recommendations on which type of feedback and algorithms work best. We find the recently introduced Director model (Arora et al., '22) shows significant improvements over other existing approaches.
翻译:为模仿静态数据集而培训的冷冻模型永远无法改进其性能。在部署期间,可以使用互联网检索更新信息并获得人类反馈的模型有可能适应新信息并改进其性能。在这项工作中,我们研究如何在这样的学习框架内改进互联网驱动的谈话技能。我们收集了部署数据,并公开了人类互动数据,收集了各种人类反馈,包括二元质量测量、自由格式文本反馈和细微的失败原因。我们随后研究了各种算法,从这些反馈中改进,包括标准监管学习、拒绝抽样、模式指导和奖励学习,以便就哪类反馈和算法最有效提出建议。我们发现最近推出的主任模型(Arora等人,22年)与其他现有方法相比有显著改进。