This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes. It is constructed such that unimodal models struggle and only multimodal models can succeed: difficult examples ("benign confounders") are added to the dataset to make it hard to rely on unimodal signals. The task requires subtle reasoning, yet is straightforward to evaluate as a binary classification problem. We provide baseline performance numbers for unimodal models, as well as for multimodal models with various degrees of sophistication. We find that state-of-the-art methods perform poorly compared to humans (64.73% vs. 84.7% accuracy), illustrating the difficulty of the task and highlighting the challenge that this important problem poses to the community.
翻译:这项工作为多式联运分类提出了一套新的挑战,重点是发现多式联运模式中的仇恨言论,其构建方式是单式模式斗争,只有多式联运模式才能取得成功:在数据集中添加了困难的例子(“benign confounders” ), 使其难以依赖单式信号。 这项任务需要微妙的推理, 但作为一个二元分类问题, 却直截了当地加以评估。 我们为单式模式模式和复杂程度不同的多式联运模式提供了基准性能数字。 我们发现,与人类相比,最先进的方法表现不佳( 64.73% 与84.7% 的精确度相比 ), 说明了任务难度,并突出了这一重要问题给社会带来的挑战 。