Memes are pixel-based multimedia documents containing images and expressions that usually raise a funny meaning when mixed. Hateful memes are also spread hatred through social networks. Automatically detecting the hateful memes would help reduce their harmful societal influence. The challenge of hateful memes detection lies in its multimodal information, unlike the conventional multimodal tasks, where the visual and textual information are semantically aligned. The multimodal information in the meme is weakly aligned or even irrelevant, which makes the model not only needs to understand the content in the memes but also reasoning over the multiple modalities. In this paper, we propose a novel method that incorporates the image captioning process into the memes detection process. We conducted extensive experiments on meme datasets and illustrated the effectiveness of our method. Our model also achieves promising results on the Hateful memes detection challenge.
翻译:Memes是基于像素的多媒体文件,含有图像和表达方式,通常在混杂时会产生有趣的意义。仇恨的Memes也通过社交网络传播仇恨。自动发现仇恨的Memes将有助于减少其有害的社会影响。仇恨的Memes检测挑战在于其多式信息,不同于常规的多式任务,即视觉和文字信息在语义上是一致的。Memes的多式信息不甚一致,甚至不相干,使得该模型不仅需要理解Memes的内容,而且还需要对多种模式进行推理。在本文中,我们提出了一个新颖的方法,将图像说明过程纳入Memes检测过程。我们在Memes数据集上进行了广泛的实验,并展示了我们方法的有效性。我们的模型还在仇恨的Memes检测挑战上取得了可喜的成果。