Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant corporations own platforms where millions of users log in every day, and protection from exposure to similar phenomena appears to be necessary in order to comply with the corresponding legislation and maintain a high level of service quality. A robust and reliable system for detecting and preventing the uploading of relevant content will have a significant impact on our digitally interconnected society. Several aspects of our daily lives are undeniably linked to our social profiles, making us vulnerable to abusive behaviours. As a result, the lack of accurate hate speech detection mechanisms would severely degrade the overall user experience, although its erroneous operation would pose many ethical concerns. In this paper, we present 'ETHOS', a textual dataset with two variants: binary and multi-label, based on YouTube and Reddit comments validated using the Figure-Eight crowdsourcing platform. Furthermore, we present the annotation protocol used to create this dataset: an active sampling procedure for balancing our data in relation to the various aspects defined. Our key assumption is that, even gaining a small amount of labelled data from such a time-consuming process, we can guarantee hate speech occurrences in the examined material.
翻译:在线仇恨言论是我国社会最近的一个问题,通过利用大多数社交媒体平台所特有的相应制度的脆弱性,正在稳步上升。这一现象主要通过在用户互动期间或以张贴多媒体背景下出现的冒犯性评论而加剧。如今,巨型公司拥有每天有数百万用户登录的平台,而且保护人们免受类似现象的危害似乎是必要的,以便遵守相应的立法并保持高水平的服务质量。一个可靠和可靠的检测和防止相关内容上传的系统将对我们数字化互联的社会产生重大影响。我们日常生活中的一些方面不可否认地与我们的社会概况联系在一起,使我们容易受到虐待行为的影响。因此,缺乏准确的仇恨言论检测机制将严重降低整个用户的经验,尽管其错误的操作将引起许多伦理问题。在本文中,我们介绍“ETHOS”这一文本数据集,有两种变式:二进制和多标签,根据使用数字-视觉群集平台验证的评论进行验证。此外,我们介绍了用于创建这一数据集的注释性协议,使我们的这个数据集成成为容易受虐待的行为。因此,缺乏准确的仇恨言论检测机制将严重降低整个用户的经验,尽管其错误的操作将引起许多伦理问题。在本文中,我们定义的数据中可以进行精确的抽样分析。