Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. It finds applications in many NLP tasks such as opinion mining, sentiment analysis, etc. Today, social media has given rise to an abundant amount of multimodal data where users express their opinions through text and images. Our paper aims to leverage multimodal data to improve the performance of the existing systems for sarcasm detection. So far, various approaches have been proposed that uses text and image modality and a fusion of both. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Further, we integrate feature-wise affine transformation by conditioning the input image through FiLMed ResNet blocks with the textual features using the GRU network to capture the multimodal information. The output from both the models and the CLS token from RoBERTa is concatenated and used for the final prediction. Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal sarcasm detection dataset.
翻译:沙皇探测发现自然语言的表达方式与表面含义所隐含的表达方式不同。 它发现许多国家语言方案任务中的应用,如意见挖掘、情绪分析等。 今天,社交媒体产生了大量多式联运数据,用户通过文字和图像表达自己的意见。 我们的论文旨在利用多式联运数据来改善现有讽刺探测系统的业绩。 到目前为止,已经提出了各种办法,使用文字和图像模式以及两者的结合。 我们提议了一个新结构,使用RoBERTA模型,上面有一个共同注意层,以纳入输入文本和图像属性之间的环境一致性。 此外,我们通过FILMed ResNet块将基于地缘的切换与文本特征相结合,通过FILMed ResNet块对输入图像进行调节,利用GRU网络获取多式联运信息。模型和来自RoBERTA的CLS标语的输出被组合并用于最后的预测。我们提议的模型显示,我们提出的模型比公共多式联运探测器的6.14% F1分的F1分数将现有艺术状态方法比成。