Exploiting social media to spread hate has tremendously increased over the years. Lately, multi-modal hateful content such as memes has drawn relatively more traction than uni-modal content. Moreover, the availability of implicit content payloads makes them fairly challenging to be detected by existing hateful meme detection systems. In this paper, we present a use case study to analyze such systems' vulnerabilities against external adversarial attacks. We find that even very simple perturbations in uni-modal and multi-modal settings performed by humans with little knowledge about the model can make the existing detection models highly vulnerable. Empirically, we find a noticeable performance drop of as high as 10% in the macro-F1 score for certain attacks. As a remedy, we attempt to boost the model's robustness using contrastive learning as well as an adversarial training-based method - VILLA. Using an ensemble of the above two approaches, in two of our high resolution datasets, we are able to (re)gain back the performance to a large extent for certain attacks. We believe that ours is a first step toward addressing this crucial problem in an adversarial setting and would inspire more such investigations in the future.
翻译:多年来,利用社交媒体散布仇恨以散布仇恨的现象急剧增加。最近,Memes等多式仇恨内容比单式内容吸引的牵引力要强得多。此外,隐含内容的有效载荷的可得性使其相当具有挑战性,难以被现有的仇恨性 meme 探测系统探测出来。在本文中,我们提出了一个用案例研究来分析这些系统在外部对抗性攻击面前的脆弱性。我们发现,在对模型知之甚少的人类所实施的单式和多式环境中,即使是非常简单的干扰也会使现有的探测模型变得非常脆弱。我们同时发现,在某些攻击中,在宏观-F1评分中,我们发现显著的性能下降高达10 % 。作为一种补救措施,我们试图利用对比性学习和以对抗性培训为基础的方法(VILLA)来增强模型的稳健性。我们发现,在两种高分辨率数据集中,利用上述两种方法的组合,我们能够(再次)使现有探测模型变得非常脆弱。我们认为,在某种攻击中,我们的调查是未来最关键的一个步骤。我们认为,在解决这一关键问题时,我们的第一个步骤是朝这个关键的步骤。