低资源语言多种语文 (Multilingual Offensive Language Identification for Low-resource Languages)

from arxiv, Accepted to ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). This is an extended version of a paper accepted to EMNLP (arXiv:2010.05324). arXiv admin note: substantial text overlap with arXiv:2010.05324

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task, 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020, 0.8568 F1 macro for Hindi in HASOC 2019 shared task and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) showing that our approach compares favourably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic, and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

翻译：在社会媒体中,攻击性内容十分普遍,是公司和政府组织关切的一个原因。最近公布了一些研究,以发现各种形式的这类内容(例如仇恨言论、网络欺凌和网络侵犯),这些研究中明显多数涉及英语部分,因为大多数附加说明的数据集包含英语数据。在本文中,我们利用现有的英国数据集,采用跨语背景字嵌入和传输学习,用低资源语言作出预测。我们预测了阿拉伯文、孟加拉文、丹麦文、希腊文、印地文、西班牙文和土耳其文的可比数据的预测。我们报告了TRAC-2共同任务中孟加拉文0.8415 F1宏观的预测结果,丹麦文0.8532 F1宏观和奥登斯瓦尔2020年希腊文0.8701 F1宏观的预测结果,奥登斯2019年共同任务中印地文0.8568 F1宏观数据,SemEval 2019任务5(哈瓦那)中西班牙语0.7513 F1宏观数据。我们预测了我们的方法优于最近向这三种语文共同任务提交的最佳系统。此外,我们报告了2020年阿拉伯文、土耳其文背景学习成果的竞争性转让。