Detecting offensive language on social media is an important task. The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets. The dataset, however, suffers from class imbalance, where certain labels are extremely rare compared with other classes (e.g, the hateful class is only 5% of the data). In this work, we present Dager (Data Augmenter), a generation-based data augmentation method, that improves the performance of classification on imbalanced and low-resource data such as the offensive language dataset. Dager extracts the lexical features of a given class, and uses these features to guide the generation of a conditional generator built on GPT-2. The generated text can then be added to the training set as augmentation data. We show that applying Dager can increase the F1 score of the data challenge by 11% when we use 1% of the whole dataset for training (using BERT for classification); moreover, the generated data also preserves the original labels very well. We test Dager on four different classifiers (BERT, CNN, Bi-LSTM with attention, and Transformer), observing universal improvement on the detection, indicating our method is effective and classifier-agnostic.
翻译:检测社交媒体中的冒犯性语言是一项重要任务。ICWSM 2020 数据挑战任务2旨在使用包含100公里贴标签的推文的人群源数据集来识别冒犯性内容。但数据集存在阶级不平衡,某些标签与其他类别相比极为罕见(例如,仇恨类仅占数据数据的5%),因此某些标签与其他类别相比极为罕见。在这项工作中,我们介绍Dager(数据增强者)这一基于生成的数据增强方法,即基于生成的数据增强方法,该方法改进了诸如攻击性语言数据集等不平衡和低资源数据分类的性能。Dager提取了某一类的词汇特征,并使用这些特征来指导在GPT-2上建造的有条件发电机的生成。随后,生成的文本可以作为增强数据添加到培训组中。我们表明,当我们使用整个数据集的1%用于培训时,Dager(使用BERT进行分类);此外,生成的数据还保留了原有的标签。我们用四个不同的分类器(BERT、CNN、BIS-LS和GILS)测试了四个不同的分类器(BERET、CNIS和GILAGLS的测量方法是有效的观测和GIS。