Harmful content is pervasive on social media, poisoning online communities and negatively impacting participation. A common approach to address this issue is to develop detection models that rely on human annotations. However, the tasks required to build such models expose annotators to harmful and offensive content and may require significant time and cost to complete. Generative AI models have the potential to understand and detect harmful content. To investigate this potential, we used ChatGPT and compared its performance with MTurker annotations for three frequently discussed concepts related to harmful content: Hateful, Offensive, and Toxic (HOT). We designed five prompts to interact with ChatGPT and conducted four experiments eliciting HOT classifications. Our results show that ChatGPT can achieve an accuracy of approximately 80% when compared to MTurker annotations. Specifically, the model displays a more consistent classification for non-HOT comments than HOT comments compared to human annotations. Our findings also suggest that ChatGPT classifications align with provided HOT definitions, but ChatGPT classifies "hateful" and "offensive" as subsets of "toxic." Moreover, the choice of prompts used to interact with ChatGPT impacts its performance. Based on these in-sights, our study provides several meaningful implications for employing ChatGPT to detect HOT content, particularly regarding the reliability and consistency of its performance, its understand-ing and reasoning of the HOT concept, and the impact of prompts on its performance. Overall, our study provides guidance about the potential of using generative AI models to moderate large volumes of user-generated content on social media.
翻译:有害内容在社交媒体上普遍存在,污染在线社区并对参与产生负面影响。解决这个问题的一种常见方法是开发依赖于人类标注的检测模型。然而,构建这种模型所需的任务会使标注者接触到有害和令人反感的内容,并且可能需要大量的时间和成本来完成。生成型AI模型有潜力理解和检测有害内容。为了研究这种潜力,我们使用了ChatGPT,并与MTurker注释进行了比较,在与有害内容相关的三个经常讨论的概念:仇恨、冒犯和有毒(HOT)方面比较了它的性能。我们设计了五个提示与ChatGPT进行交互,并进行了四个实验,引出了HOT分类。我们的结果显示,与MTurker注释相比,ChatGPT可以达到近80%的准确率。具体而言,与人类注释相比,该模型对于非HOT评论的分类更具一致性。我们的研究还表明,ChatGPT的分类与提供的HOT定义相一致,但将“仇恨”和“冒犯”分类为“有毒”的子集。此外,用于与ChatGPT交互的提示选择会影响其性能。基于这些观察结果,我们的研究提供了几个关于使用ChatGPT检测HOT内容的有意义的启示,特别是关于其性能的可靠性和一致性、其理解和推理HOT概念的能力以及提示对其性能的影响。总的来说,我们的研究提供了关于使用生成型AI模型在社交媒体上控制大量用户生成内容的潜力的指导。