Text detoxification has the potential to mitigate the harms of toxicity by rephrasing text to remove offensive meaning, but subtle toxicity remains challenging to tackle. We introduce MaRCo, a detoxification algorithm that combines controllable generation and text rewriting methods using a Product of Experts with autoencoder language models (LMs). MaRCo uses likelihoods under a non-toxic LM (expert) and a toxic LM (anti-expert) to find candidate words to mask and potentially replace. We evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but MaRCo's rewrites are preferred 2.1 $\times$ more in human evaluation. Its applicability to instances of subtle toxicity is especially promising, demonstrating a path forward for addressing increasingly elusive online hate.
翻译:文本解毒有可能通过改写文本来消除冒犯性含义来减轻毒性的危害,但微妙的毒性仍然难以解决。 我们引入了MaRCo,这是一个解毒算法,结合可控的生成和文本重写方法,使用专家产品和自动编码语言模型(LMs ) 。 MaRCo使用非毒性LM(专家)和毒性LM(专家)下的可能性来寻找隐蔽和可能取代的候选词。我们评估了我们在若干微妙的毒性和微反射数据集上的方法,并表明它不仅优于自动测量的基线,而且在人类评估中,MaRCo的重写者更受欢迎2.1美元\timels。 它对微妙毒性案例的适用性特别有希望,为解决日益难以捉摸的网上仇恨指明了前进的道路。