We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
翻译:我们微调GPT-3,用基于文本的网络浏览环境回答长式问题,这种环境允许模型搜索和浏览网络。 通过设置任务,让人类能够完成。 通过设置任务,我们有能力通过模仿学习来培训任务模型,然后通过人类反馈优化回答质量。为了使对事实准确性的人类评估更加容易,模型必须收集参考文献,同时浏览支持其回答。我们培训和评估了我们关于ELI5的模型,这是Reddit用户询问问题的数据集。我们的最佳模型是通过使用行为克隆对GPT-3进行微调获得的,然后对经过训练的预测人类偏好奖赏模型进行拒绝抽样。人类56%的时间和我们人类示威者69%的时间和最受欢迎的Reddit答案。