Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of English data available by applying cross-lingual contextual word embeddings and transfer learning to make predictions in languages with less resources. We project predictions on comparable data in Bengali, Hindi, and Spanish and we report results of 0.8415 F1 macro for Bengali, 0.8568 F1 macro for Hindi, and 0.7513 F1 macro for Spanish. Finally, we show that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages, confirming the robustness of cross-lingual contextual embeddings and transfer learning for this task.
翻译:攻击性内容在社交媒体中十分普遍,是公司和政府组织关切的一个原因。最近公布了一些调查方法,以发现这种内容的各种形式(例如仇恨言论、网络打击和网络侵犯);这些研究中明显多数部分涉及英语,因为大多数附加说明的数据集都包含英语数据。在本文中,我们通过使用跨语言背景词嵌入和传输学习来利用英语数据,用较少资源的语言进行预测;我们预测孟加拉语、印地语和西班牙语的可比数据,我们报告孟加拉语的预测结果为0.8415 F1宏观数据,印地语为0.8568 F1宏观数据,西班牙语为0.7513 F1宏观数据。最后,我们表明,我们的方法优于提交的最佳系统,与最近关于这三种语言的共享任务相比,证实了跨语言背景嵌入和传输学习的可靠性。