Transformer-based Language Models (LMs) achieve remarkable performances on a variety of NLU tasks, but are also prone to generating toxic texts such as insults, threats, and profanities which limit their adaptations to the real-world applications. To overcome this issue, a few text generation approaches aim to detoxify toxic texts with additional LMs or perturbations. However, previous methods require excessive memory, computations, and time which are serious bottlenecks in their real-world application. To address such limitations, we propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space. Specifically, we project the latent space of an original Transformer LM to a discriminative latent space on which the texts are well-separated by their attributes, with the help of a projection block and a discriminator. This allows the LM to control the text generation to be non-toxic with minimal memory and computation overhead. We validate our model, Attribute-Discriminative Language Model (ADLM) on detoxified language and dialogue generation tasks, on which our method significantly outperforms baselines both in performance and efficiency.
翻译:以变换器为基础的语言模型(LMS)在各种新LU任务上取得了显著的成绩,但也容易产生一些有毒的文字,例如侮辱、威胁和亵渎,限制其适应现实世界应用。为了克服这一问题,一些文本生成方法旨在用额外的LMS或扰动来解毒有毒文字。然而,以往的方法要求过量的内存、计算和时间,这些是现实世界应用的严重瓶颈。为了解决这些局限性,我们建议了一种有效的语言解毒方法,使用属性分辨潜质空间。具体地说,我们预测了原变异器LM的潜在空间,到一个有歧视性的潜在空间,而案文的特性正是在这个空间上,借助一个投影块和偏移器加以很好地分隔。这样,LM就能控制文本生成时不会有毒,而记忆和计算的频率最小。我们验证了我们关于解毒语言和对话生成任务的模型(MADLM)的内分辨语言模型(ADLM),我们的方法在性能和效率方面大大超出基准。