We study the selection of transfer languages for automatic abusive language detection. Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection. This way we can use existing data from higher-resource languages to build better detection systems for low-resource languages. Our datasets are from seven different languages from three language families. We measure the distance between the languages using several language similarity measures, especially by quantifying the World Atlas of Language Structures. We show that there is a correlation between linguistic similarity and classifier performance. This discovery allows us to choose an optimal transfer language for zero shot abusive language detection.
翻译:我们研究自动滥用语言检测的转移语言选择。我们没有为每种语言准备数据集,而是展示跨语言传输学习的有效性,用于零射出滥用语言检测。这样,我们可以使用来自高资源语言的现有数据,为低资源语言建立更好的检测系统。我们的数据集来自来自三种语言的七种不同语言。我们用几种语言相似性衡量语言之间的距离,特别是量化《世界语言结构图集》。我们显示语言相似性与分类性能之间存在关联。这一发现使我们能够选择一种最佳的传输语言,用于零射击滥用语言检测。