Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, its variants, GPT-2/3, and others. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended state of the art for many NLP tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show that recent improvements of LMs also store ethical and moral norms of the society and actually bring a "moral direction" to surface. In this study, we show that these norms can be captured geometrically by a direction, which can be computed, e.g., by a PCA, in the embedding space, reflecting well the agreement of phrases to social norms implicitly expressed in the training texts. Furthermore, this provides a path for attenuating or even preventing toxic degeneration in LMs. Being able to rate the (non-)normativity of arbitrary phrases without explicitly training the LM for this task, we demonstrate the capabilities of the moral direction for guiding (even other) LMs towards producing normative text and showcase it on RealToxicityPrompts testbed, preventing the neural toxic degeneration in GPT-2.
翻译:人工写作使我们的生活充满了生命,这是因为在大规模、基于变压器的语言模型(LMS)(如BERT、其变异物、GPT-2/3等)等大规模、基于变压器的语言模型(LMS)方面最近有所进步。利用这些模型作为预先培训的模型和对具体任务进行微调的模型,研究人员就许多NLP任务扩大了先进程度,并表明他们不仅掌握了语言知识,而且还保留了数据中隐含的一般知识。不幸的是,关于未经过滤的文本整体体的LMS培训的LMS课程受到堕落和偏见行为的影响。虽然这一点已经确立,但我们显示最近LMS的改进也储存了社会的道德和道德规范规范规范规范规范规范,实际上也给社会带来了一种“道德方向”。在这项研究中,我们展示了这些规范规范的几何方向,例如由常设仲裁院在嵌入空间进行计算,从而很好地反映了在培训文本中隐含的社会规范用语的一致。此外,这为在LMSMs公司中减少或甚至防止有毒的降解,能够将这种任意的道德定位定位定位定位定位显示我们展示的文字的道德方向。