The goal of sexism detection is to mitigate negative online content targeting certain gender groups of people. However, the limited availability of labeled sexism-related datasets makes it problematic to identify online sexism for low-resource languages. In this paper, we address the task of automatic sexism detection in social media for one low-resource language -- Chinese. Rather than collecting new sexism data or building cross-lingual transfer learning models, we develop a cross-lingual domain-aware semantic specialisation system in order to make the most of existing data. Semantic specialisation is a technique for retrofitting pre-trained distributional word vectors by integrating external linguistic knowledge (such as lexico-semantic relations) into the specialised feature space. To do this, we leverage semantic resources for sexism from a high-resource language (English) to specialise pre-trained word vectors in the target language (Chinese) to inject domain knowledge. We demonstrate the benefit of our sexist word embeddings (SexWEs) specialised by our framework via intrinsic evaluation of word similarity and extrinsic evaluation of sexism detection. Compared with other specialisation approaches and Chinese baseline word vectors, our SexWEs shows an average score improvement of 0.033 and 0.064 in both intrinsic and extrinsic evaluations, respectively. The ablative results and visualisation of SexWEs also prove the effectiveness of our framework on retrofitting word vectors in low-resource languages.
翻译:性別歧視檢測的目標是減輕針對特定性別人群的負面網絡內容。然而,有標籤的性別歧視相關數據集的有限可用性使得對低資源語言進行網絡性別歧視檢測成為問題。在本文中,我們解決了一種低資源語言的自動性別歧視檢測任務,即中文。我們開發了一個跨語義領域專精系統,以最大限度地利用現有數據,而不是收集新的性別歧視數據或建立跨語言轉移學習模型。 語義專精是一種技術,通過將外部語言知識(例如詞彙-語義關係)整合到專門的特徵空間中,對預先訓練的分佈式詞向量進行重構。為此,我們利用一種高資源語言(英語)的性別語義資源,將目標語言(中文)的預訓練詞向量進行特殊化,以注入領域知識。我們通過詞相似性的內在評估和性別歧視的外在評估來展示我們通过框架进行特貢化的性别词嵌入(SexWEs)的好处。与其他专业化方法和中文基线词向量相比,我们的SexWEs显示了平均得分提高了0.033和0.064的内在和外在评估。ablative结果和SexWEs的可视化还证明了我们在低资源语言中重构词向量的框架的有效性。