Multimodal Language Analysis is a demanding area of research, since it is associated with two requirements: combining different modalities and capturing temporal information. During the last years, several works have been proposed in the area, mostly centered around supervised learning in downstream tasks. In this paper we propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. Towards this end, we map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. Extensive experimentation on Sentiment Analysis (MOSEI) and Emotion Recognition (IEMOCAP) indicate that the learned representations can achieve near-state-of-the-art performance with just the use of a Logistic Regression algorithm for downstream classification. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters. The proposed multimodal representation models are open-sourced and will help grow the applicability of Multimodal Language.
翻译:多模式语言分析是一个要求很高的研究领域,因为它与两个要求相关:结合不同模式和捕捉时间信息。在过去几年里,提出了在该地区的一些工作,主要围绕在下游任务中监督学习。在本文件中,我们建议提取不受监督的多模式语言表述,这些表述是通用的,可以应用于不同的任务。为此,我们将字级统一多式联运序列绘制成二维矩阵,然后使用进化自动编码器通过合并多个数据集学习嵌入。关于感化分析(MOSEI)和情感识别(IEMOCAP)的广泛实验表明,所学的表述仅使用下游分类的物流回归算法即可达到近于最新的业绩。还表明,我们的方法非常轻,很容易被其他任务和隐蔽数据所覆盖,其性能下降小,参数几乎相同。拟议的多式联运表述模型是公开的,将有助于扩大多模式语言的应用。