Due to the subtleness, implicity, and different possible interpretations perceived by different people, detecting undesirable content from text is a nuanced difficulty. It is a long-known risk that language models (LMs), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity. However, recent studies imply that, as a remedy, LMs are also capable of identifying toxic content without additional fine-tuning. Prompt-methods have been shown to effectively harvest this surprising self-diagnosing capability. However, existing prompt-based methods usually specify an instruction to a language model in a discriminative way. In this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering. We evaluate on three datasets with toxicity labels annotated on social media posts. Our analysis highlights the strengths of our generative classification approach both quantitatively and qualitatively. Interesting aspects of self-diagnosis and its ethical implications are discussed.
翻译:由于不同的人所认为的微妙、隐含和不同的解释,发现文本中不可取的内容是一个细微的困难,语言模型(LMs)一旦受过关于含有不可取内容的内容的训练,就具有显示偏见和毒性的力量,这是一个众所周知的风险;然而,最近的研究表明,作为一种补救办法,LMs也可以在不作额外微调的情况下查明有毒内容。快速方法已经证明能够有效地收获这种令人惊讶的自我诊断能力。然而,现有的迅速方法通常会以歧视性的方式具体规定对语言模型的指示。在这项工作中,我们探讨零弹即时毒性检测的基因变异,并全面试验迅速工程。我们评估三套带有毒性标签的数据集,在社会媒体文章上加注注。我们的分析突出了我们从定量和定性上归正分类方法的优点。讨论自我诊断及其伦理影响的有趣方面。