Sarcasm generation has been investigated in previous studies by considering it as a text-to-text generation problem, i.e., generating a sarcastic sentence for an input sentence. In this paper, we study a new problem of cross-modal sarcasm generation (CMSG), i.e., generating a sarcastic description for a given image. CMSG is challenging as models need to satisfy the characteristics of sarcasm, as well as the correlation between different modalities. In addition, there should be some inconsistency between the two modalities, which requires imagination. Moreover, high-quality training data is insufficient. To address these problems, we take a step toward generating sarcastic descriptions from images without paired training data and propose an Extraction-Generation-Ranking based Modular method (EGRM) for cross-model sarcasm generation. Specifically, EGRM first extracts diverse information from an image at different levels and uses the obtained image tags, sentimental descriptive caption, and commonsense-based consequence to generate candidate sarcastic texts. Then, a comprehensive ranking algorithm, which considers image-text relation, sarcasticness, and grammaticality, is proposed to select a final text from the candidate texts. Human evaluation at five criteria on a total of 1200 generated image-text pairs from eight systems and auxiliary automatic evaluation show the superiority of our method.
翻译:在以往的研究中,对讽刺学的生成进行了调查,将它视为文字到文字的生成问题,即:为输入句生成讽刺句;在本文中,我们研究了跨模式讽刺学的生成(CMSG)的新问题,即为特定图像生成讽刺性描述。CMSG具有挑战性,因为模型需要满足讽刺学的特征以及不同模式之间的相互关系。此外,两种模式之间应该有一些不一致,这需要想象力。此外,高质量的培训数据是不够的。为了解决这些问题,我们采取了一步,从没有配对培训数据的图像中生成讽刺学的描述,并提出了基于跨模式讽刺学的生成(CMMCSG)方法(EGRM)的新问题。具体地说,GEMM首先从不同层次的图像中提取多种信息,并使用获得的图像标记、感性描述性描述性说明,以及基于共同性的结果来生成候选人的正文文本。然后,从全面的排名算法,从最终的图像生成标准到最终的图像生成系统,从最终的文本生成方式,从人类图像生成的预估标定的排名,从最终的文本和预选的排名,从最终的图像生成。