State-of-the-art approaches for hate-speech detection usually exhibit poor performance in out-of-domain settings. This occurs, typically, due to classifiers overemphasizing source-specific information that negatively impacts its domain invariance. Prior work has attempted to penalize terms related to hate-speech from manually curated lists using feature attribution methods, which quantify the importance assigned to input terms by the classifier when making a prediction. We, instead, propose a domain adaptation approach that automatically extracts and penalizes source-specific terms using a domain classifier, which learns to differentiate between domains, and feature-attribution scores for hate-speech classes, yielding consistent improvements in cross-domain evaluation.
翻译:最先进的仇恨言论检测方法通常在外域环境中表现不佳,这通常是由于分类者过分强调源特有的信息,对其领域差异产生了负面影响。先前的工作试图惩罚使用特征归属方法人工整理的列表中与仇恨言论有关的术语,这些术语量化了分类者在作出预测时对输入术语的重视程度。相反,我们提议了一种领域适应方法,即使用域分类器自动提取和处罚源特有的术语,该分类器学会区分领域和仇恨类的特征归属分数,从而在跨域评价方面不断改进。