To proactively offer social media users a safe online experience, there is a need for systems that can detect harmful posts and promptly alert platform moderators. In order to guarantee the enforcement of a consistent policy, moderators are provided with detailed guidelines. In contrast, most state-of-the-art models learn what abuse is from labelled examples and as a result base their predictions on spurious cues, such as the presence of group identifiers, which can be unreliable. In this work we introduce the concept of policy-aware abuse detection, abandoning the unrealistic expectation that systems can reliably learn which phenomena constitute abuse from inspecting the data alone. We propose a machine-friendly representation of the policy that moderators wish to enforce, by breaking it down into a collection of intents and slots. We collect and annotate a dataset of 3,535 English posts with such slots, and show how architectures for intent classification and slot filling can be used for abuse detection, while providing a rationale for model decisions.
翻译:为了积极主动地向社交媒体用户提供安全的在线经验,需要建立能够检测有害职位和及时提醒平台主持人的系统。为了保证执行一致的政策,向主持人提供了详细的指南。相比之下,大多数最先进的模型从贴上标签的例子中了解了哪些滥用行为,从而根据虚假的提示(如群体识别特征的存在,这些识别特征可能不可靠)作出预测。在这项工作中,我们引入了政策认知滥用检测概念,放弃对系统能够可靠地从单独检查数据中了解哪些现象构成滥用现象的不切实际期望。我们建议对主持人希望执行的政策进行一种机器友好化的表述,将其细分为意图和空档集。我们收集了3 535个带有这种空档的英文职位的数据集,并注明了如何利用意图分类和职位填充结构来检测滥用行为,同时为示范决定提供理由。