Image memes and specifically their widely-known variation image macros, is a special new media type that combines text with images and is used in social media to playfully or subtly express humour, irony, sarcasm and even hate. It is important to accurately retrieve image memes from social media to better capture the cultural and social aspects of online phenomena and detect potential issues (hate-speech, disinformation). Essentially, the background image of an image macro is a regular image easily recognized as such by humans but cumbersome for the machine to do so due to feature map similarity with the complete image macro. Hence, accumulating suitable feature maps in such cases can lead to deep understanding of the notion of image memes. To this end, we propose a methodology that utilizes the visual part of image memes as instances of the regular image class and the initial image memes as instances of the image meme class to force the model to concentrate on the critical parts that characterize an image meme. Additionally, we employ a trainable attention mechanism on top of a standard ViT architecture to enhance the model's ability to focus on these critical parts and make the predictions interpretable. Several training and test scenarios involving web-scraped regular images of controlled text presence are considered in terms of model robustness and accuracy. The findings indicate that light visual part utilization combined with sufficient text presence during training provides the best and most robust model, surpassing state of the art.
翻译:图像 meme, 特别是其广为人知的变异图像宏, 是特殊的新媒体类型, 将文字与图像相结合, 在社交媒体中使用, 以玩耍或低调表达幽默、 讽刺、 讽刺甚至仇恨。 从社交媒体中准确检索图像mememe, 以更好地捕捉在线现象的文化和社会方面, 并发现潜在的问题( 仇恨- 声音、 虚假信息 ) 。 基本而言, 图像宏的背景图像是人类很容易认出的常规图像, 但机器之所以如此如此繁琐,是因为图像与完整的图像宏相近。 因此, 在这类案例中积累合适的地貌地图可以导致深刻理解图像mememe 的概念。 为此, 我们提出一种方法, 将图像meme 的视觉部分作为常规图像类的范例, 以及最初的图像 meme 实例, 迫使模型集中关注作为图像模型模型模型模型特征的关键部分。 此外, 我们在标准 ViT 架构的顶端使用一个可训练的注意机制, 以增强模型对关键部分的焦点能力, 并且使图像的图像的精度得到最精确的精准的图像的精准性解释。