Pretrained large language models have become indispensable for solving various natural language processing (NLP) tasks. However, safely deploying them in real world applications is challenging because they generate toxic content. To address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity. Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks as well as improving AUC scores on four bias detection tasks by 1.3%. We also demonstrate the generalizability of our techniques by scaling the number of training samples and the number of model parameters.
翻译:受过训练的大型语言模型对于解决各种自然语言处理(NLP)任务已经变得不可或缺。然而,在现实世界应用中安全地部署这些模型具有挑战性,因为它们产生有毒内容。为了应对这一挑战,我们提出了两项新的培训前数据增强战略,在无损其效用的情况下大幅降低模型毒性。我们的两个战略是:(1) MEDA:将原始毒性分数作为元数据添加到培训前样本中,(2) INST:给这些样本增加说明其毒性的指示。我们的结果表明,我们的最佳执行战略(INST)将毒性概率大幅降低至61%,同时保持5项基准国家语言处理任务的准确性,并将澳大利亚大学四项偏差检测任务的得分提高1.3%。我们还通过扩大培训样本的数量和模型参数的数量,展示了我们技术的通用性。