Image memes and specifically their widely-known variation image macros, is a special new media type that combines text with images and is used in social media to playfully or subtly express humour, irony, sarcasm and even hate. It is important to accurately retrieve image memes from social media to better capture the cultural and social aspects of online phenomena and detect potential issues (hate-speech, disinformation). Essentially, the background image of an image macro is a regular image easily recognized as such by humans but cumbersome for the machine to do so due to feature map similarity with the complete image macro. Hence, accumulating suitable feature maps in such cases can lead to deep understanding of the notion of image memes. To this end, we propose a methodology, called Visual Part Utilization, that utilizes the visual part of image memes as instances of the regular image class and the initial image memes as instances of the image meme class to force the model to concentrate on the critical parts that characterize an image meme. Additionally, we employ a trainable attention mechanism on top of a standard ViT architecture to enhance the model's ability to focus on these critical parts and make the predictions interpretable. Several training and test scenarios involving web-scraped regular images of controlled text presence are considered for evaluating the model in terms of robustness and accuracy. The findings indicate that light visual part utilization combined with sufficient text presence during training provides the best and most robust model, surpassing state of the art. Source code and dataset are available at https://github.com/mever-team/memetector.
翻译:图像 meme, 特别是其广为人知的变异图像宏, 是一种特殊的新媒体类型, 将文字与图像相融合, 并在社交媒体中使用, 以玩耍或低调表达幽默、 讽刺、 讽刺甚至仇恨。 从社交媒体中准确检索图像mememe, 以更好地捕捉网络现象的文化和社会方面, 并发现潜在的问题( 假信息 ) 。 基本而言, 图像宏的背景图像是人类很容易认出的常规图像, 但机器这样做很麻烦, 原因是将图像与完整的图像宏相类似。 因此, 在这类案例中积累合适的地貌地图可以导致深入理解图像mememe的概念。 为此, 我们提出一种方法, 称为视觉部分, 将图像 meme 的视觉部分作为普通图像类的范例, 以及 原始图像类的原始图像类 迫使模型集中关注模型的关键性部分 。 此外, 我们在标准 ViT 结构顶部上设置一个可训练的注意机制, 以增强模型的图像/ 的准确性能力, 用于这些关键部分的常规的图像利用。 测试中, 提供最稳健健的文本。 的图像的文本 。