Language models can learn a range of capabilities from unsupervised training on text corpora. However, to solve a particular problem (such as text summarization) it is typically necessary to fine-tune them on a task-specific dataset. It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons. However, collecting a large preference comparison dataset is still expensive -- and the learned reward models are unreliable out-of-distribution. We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning (RL). Specifically, we use bootstrap aggregating (bagging) to train an ensemble of reward models differing in the initialization of their final layer. Ensembles have proved successful in prior applications of active learning, but we find that in our setting ensemble active learning does not outperform random sampling. Further experiments show that while the aggregate predictions are well-calibrated, the ensemble's estimated epistemic uncertainty is only weakly correlated with model error. We suspect this is because the ensemble members are fine-tuned from a single model and so are similar to one another. This suggests current pre-training methods will need to be modified to support uncertainty estimation, e.g. by training multiple language models.
翻译:语言模型可以从未经监督的文本公司培训中学习一系列能力。 但是,为了解决一个具体问题,通常需要用具体任务数据集来微调它们。 通常,对于人类来说,在选择选项之间作出选择比提供标签数据更容易,而以前的工作通过培训一种优惠比较的奖励模式而取得了最先进的业绩。 然而,收集大量偏好比较数据集仍然费用昂贵,而所学的奖励模型并不可靠,因此分配不可靠。我们寻求通过不确定性估算来解决这些问题(如文本总和),通过积极学习和风险规避强化学习(RL)来提高抽样效率和稳健性。 具体地说,我们使用套靴子集(buck)来训练在初始化最后层中不同的奖励模型组合。 在先前应用积极学习的模型中,这些组合证明成功,但我们发现,在设置组合的积极学习不会超越随机抽样。 进一步的实验表明,尽管综合预测是精确的,但组合组合的效益和稳健的组合(regleglemental)会提高效率和稳健性和稳健性,因为我们所估计的模型与单一的不确定性是比重的。