Artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (LMs) such as BERT, its variants, GPT-2/3, and others. Using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended state of the art for many NLP tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. Unfortunately, LMs trained on unfiltered text corpora suffer from degenerated and biased behaviour. While this is well established, we show that recent LMs also contain human-like biases of what is right and wrong to do, some form of ethical and moral norms of the society -- they bring a "moral direction" to surface. That is, we show that these norms can be captured geometrically by a direction, which can be computed, e.g., by a PCA, in the embedding space, reflecting well the agreement of phrases to social norms implicitly expressed in the training texts and providing a path for attenuating or even preventing toxic degeneration in LMs. Being able to rate the (non-)normativity of arbitrary phrases without explicitly training the LM for this task, we demonstrate the capabilities of the "moral direction" for guiding (even other) LMs towards producing normative text and showcase it on RealToxicityPrompts testbed, preventing the neural toxic degeneration in GPT-2.
翻译:人工写作使我们的生活充满了生命,因为大型变压器语言模型(LMS)等大型变压器语言模型(LMS)最近有所进步,如BERT、其变异物、GPT-2/3等。利用这些模型作为预先训练的模型,并微调用于具体任务,研究人员就许多NLP任务扩大了最先进的水平,并表明他们不仅掌握语言知识,而且还保留数据中隐含的一般知识。不幸的是,关于未过滤文本整体的LMS培训的LMS受到堕落和偏见行为的影响。虽然这一点已经确立,但我们显示最近LMS还包含一些对正确和错误的人类偏见,一些社会道德和道德规范形式,它们给社会带来了一种“道德方向”。也就是说,我们表明,这些规范可以通过一个方向从几何角度加以体现,例如,由常设仲裁院在嵌入空间进行计算,这反映了在培训文本中隐含的社会规范用语的一致,并且提供了一种在LMS(LMS)中减少或甚至防止有毒变形的道德变现的路径。 能够明确地将G(不透明地显示其真实的文本的文本) 展示其真实方向。