Sophisticated language models such as OpenAI's GPT-3 can generate hateful text that targets marginalized groups. Given this capacity, we are interested in whether large language models can be used to identify hate speech and classify text as sexist or racist? We use GPT-3 to identify sexist and racist text passages with zero-, one-, and few-shot learning. We find that with zero- and one-shot learning, GPT-3 can identify sexist or racist text with an accuracy between 48 per cent and 69 per cent. With few-shot learning and an instruction included in the prompt, the model's accuracy can be as high as 78 per cent. We conclude that large language models have a role to play in hate speech detection, and that with further development language models could be used to counter hate speech and even self-police.
翻译:OpenAI的GPT-3等精致的语言模型能够产生针对边缘化群体的仇恨文字。鉴于这一能力,我们有兴趣了解大型语言模型是否可以用来识别仇恨言论,并将文字归类为性别歧视或种族主义?我们用GPT-3来识别带有性别歧视和种族主义文字的段落,零、一、零、少学。我们发现,通过零和一手学习,GPT-3能够识别出具有性别歧视或种族主义文字,精确度介于48%至69%之间。在短短的学习和短短的教学中,该模型的准确性可能高达78%。我们的结论是,大型语言模型在检测仇恨言论方面可以发挥作用,而有了进一步发展的语言模型,可以用来抵制仇恨言论,甚至自律。