Sarcasm is a pervading linguistic phenomenon and highly challenging to explain due to its subjectivity, lack of context and deeply-felt opinion. In the multimodal setup, sarcasm is conveyed through the incongruity between the text and visual entities. Although recent approaches deal with sarcasm as a classification problem, it is unclear why an online post is identified as sarcastic. Without proper explanation, end users may not be able to perceive the underlying sense of irony. In this paper, we propose a novel problem -- Multimodal Sarcasm Explanation (MuSE) -- given a multimodal sarcastic post containing an image and a caption, we aim to generate a natural language explanation to reveal the intended sarcasm. To this end, we develop MORE, a new dataset with explanation of 3510 sarcastic multimodal posts. Each explanation is a natural language (English) sentence describing the hidden irony. We benchmark MORE by employing a multimodal Transformer-based architecture. It incorporates a cross-modal attention in the Transformer's encoder which attends to the distinguishing features between the two modalities. Subsequently, a BART-based auto-regressive decoder is used as the generator. Empirical results demonstrate convincing results over various baselines (adopted for MuSE) across five evaluation metrics. We also conduct human evaluation on predictions and obtain Fleiss' Kappa score of 0.4 as a fair agreement among 25 evaluators.
翻译:讽刺是一种充满语言色彩的现象,而且由于其主观性、缺乏背景和深刻的观点,很难解释。在多式联运中,讽刺通过文字和视觉实体之间的不和谐表达。虽然最近的做法将讽刺作为分类问题处理,但不清楚为什么在线职位被确定为讽刺。如果没有适当的解释,最终用户可能无法感受到潜在的讽刺感。在本文中,我们提出了一个新问题 -- -- 多式Sarcasm解释(MuSE) -- -- 以包含图像和标题的多式联运讽刺性文章为主,我们的目标是通过自然语言解释文字和视觉实体之间的矛盾来表达讽刺性。为此,我们开发了一个新的数据集,解释了3510种讽刺性模型。每一种解释都是自然语言(英语)描述隐藏的讽刺性。我们用基于多式联运的变压结构来做更多的基准。我们把跨模式的注意力纳入转换器的编码中,它包含一个包含一个图像和标题,我们的目标是产生一种自然语言解释来揭示预言。为此,我们还开发了一个包含3510种讽刺性模型的新的数据集。我们用了一种具有说服力的BART-BB的模型作为模型的模型的模型,从而获得各种结果。