In the past few years, the meme has become a new way of communication on the Internet. As memes are the images with embedded text, it can quickly spread hate, offence and violence. Classifying memes are very challenging because of their multimodal nature and region-specific interpretation. A shared task is organized to develop models that can identify trolls from multimodal social media memes. This work presents a computational model that we have developed as part of our participation in the task. Training data comes in two forms: an image with embedded Tamil code-mixed text and an associated caption given in English. We investigated the visual and textual features using CNN, VGG16, Inception, Multilingual-BERT, XLM-Roberta, XLNet models. Multimodal features are extracted by combining image (CNN, ResNet50, Inception) and text (Long short term memory network) features via early fusion approach. Results indicate that the textual approach with XLNet achieved the highest weighted $f_1$-score of $0.58$, which enabled our model to secure $3^{rd}$ rank in this task.
翻译:在过去几年里,Mememe已经成为互联网上通信的一种新方式。由于Memes是嵌入文本的图像,它可以迅速传播仇恨、犯罪和暴力。将Mememes分类非常困难,因为其多式性质和针对具体区域的解释。共同的任务是开发能够识别来自多式社交媒体Memes的巨魔的模式。这项工作是我们作为参与这项任务的一部分而开发的计算模型。培训数据以两种形式出现:嵌入的泰米尔混合代码文本的图像和以英文提供的相关字幕。我们利用CNN、VGG16、Inception、多语言-BERT、XLM-Roberta、XLNet等模型对视觉和文字特征进行了调查。通过早期聚变方法将图像(CNN、ResNet50、Inception)和文字(Laut minial remement net)的特征组合在一起,通过图像(CNN、ResNet50、Inception)和文字(Ltal remember net),结果显示,与XLNet的文字方法达到了最高加权0.58美元(0.58美元)核心,使我们的模型得以在这项任务中取得3 ⁇ rd。