Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimodal tweets are short and accompanied by complex, possibly irrelevant images. We introduce a two-stream model that translates images in input space using an object-aware transformer followed by a single-pass non-autoregressive text generation approach. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. Our approach increases the amount of text available to the language model and distills the object-level information in complex images. We achieve state-of-the-art performance on two multimodal Twitter datasets without modifying the internals of the language model to accept multimodal data, demonstrating the effectiveness of our translation. In addition, we explain a failure mode of a popular approach for aspect sentiment analysis when applied to tweets. Our code is available at \textcolor{blue}{\url{https://github.com/codezakh/exploiting-BERT-thru-translation}}.
翻译:任务的目标是将愿景和语言结合起来,以理解对某一句中目标实体的情绪。Twitter是这项任务的理想环境,因为它本质上是多式的、情绪高度的,并且影响到现实世界事件。然而,多式推文很短,并配有复杂、可能无关的图像。我们引入了双流模型,用一个有物体觉悟的变异器翻译输入空间中的图像,然后采用一个单行非自动的文本生成方法。然后我们利用翻译来构建一个辅助句子,为语言模型提供多式信息。我们的方法增加了语言模型可用文本的数量,并在复杂的图像中提取目标级信息。我们在两个多式Twitter数据集上实现最新业绩,而不修改语言模型的内部,以接受多式数据,并展示我们翻译的有效性。此外,我们解释在对推特应用时流行的情感分析方法的失败模式。我们的代码可在\ textcolor{bluue-urthlation{http://gistrabrubcom.