Transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods were proposed to mitigate the language model's toxicity; however, these methods struggled to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify; A reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that is able to detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating the ability of our approach in language model detoxification and less prone to unintended bias toward social identities in generated content.
翻译:以变换器为基础的语言模型能够产生流畅的文本,并能有效地适应各种自然语言生成任务;然而,在大型未贴标签的网络文本公司上预先培训的语言模型,已经证明具有导致毒性含量下降和社会偏见的行为,从而妨碍其安全部署;提出了各种解毒方法,以减轻语言模型的毒性;然而,这些方法在以包含与性别、种族或宗教有关的具体社会身份的快感为条件的情况下,努力解毒语言模型。在本研究中,我们提议加强解毒; 强化学习方法,以减轻语言模型中的毒性。我们应对语言模型中的安全挑战,并提出一种新的奖励模式,能够发现有毒内容,并减少在毒性预测中无意地对社会认同的偏向。实验表明,语言模型解毒方法在自动评估测量中超越了现有的解毒方法,表明我们在语言模型解毒方法中的方法能力,在生成的内容中较不易出现对社会认同的意外偏见。