在多种不同背景情况下的虐待语言探测:数据集的收集和监督关注的作用 (Abusive Language Detection in Heterogeneous Contexts: Dataset Collection and the Role of Supervised Attention)

Abusive language is a massive problem in online social platforms. Existing abusive language detection techniques are particularly ill-suited to comments containing heterogeneous abusive language patterns, i.e., both abusive and non-abusive parts. This is due in part to the lack of datasets that explicitly annotate heterogeneity in abusive language. We tackle this challenge by providing an annotated dataset of abusive language in over 11,000 comments from YouTube. We account for heterogeneity in this dataset by separately annotating both the comment as a whole and the individual sentences that comprise each comment. We then propose an algorithm that uses a supervised attention mechanism to detect and categorize abusive content using multi-task learning. We empirically demonstrate the challenges of using traditional techniques on heterogeneous content and the comparative gains in performance of the proposed approach over state-of-the-art methods.

翻译：在网上社交平台上,虐待性语言探测技术是一个巨大的问题。现有的虐待性语言探测技术特别不适合包含各种虐待性语言模式的评论,即虐待和非虐待性部分。这部分是由于缺乏明确说明虐待性语言差异的数据集。我们通过在YouTube的11 000多条评论中提供带有注释说明的关于虐待性语言的数据集来应对这一挑战。我们对这一数据集中的异质性作了说明,我们分别说明整个评论和构成每项评论的个别句子。我们然后提出一种算法,利用监督的注意机制,利用多任务学习来检测和分类虐待性内容。我们从经验上展示了在使用关于多样性内容的传统技术方面的挑战,以及在对最新方法采用拟议方法方面所取得的相对收益。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

迁移学习简明教程，11页ppt

专知会员服务

108+阅读 · 2020年8月4日

【CVPR2020】时序分组注意力视频超分

专知会员服务

31+阅读 · 2020年7月1日