Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sentiment/humor detection) unless non-verbal features (e.g., acoustic and visual) can be integrated with language. Jointly modeling multiple modalities significantly increases the model complexity, and makes the training process data-hungry. While an enormous amount of text data is available via the web, collecting large-scale multimodal behavioral video datasets is extremely expensive, both in terms of time and money. In this paper, we investigate whether large language models alone can successfully incorporate non-verbal information when they are presented in textual form. We present a way to convert the acoustic and visual information into corresponding textual descriptions and concatenate them with the spoken text. We feed this augmented input to a pre-trained BERT model and fine-tune it on three downstream multimodal tasks: sentiment, humor, and sarcasm detection. Our approach, TextMI, significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks while achieving superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment analysis and multimodal humor detection) performance. We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks, particularly in a low-resource setting.
翻译:最近,经过广泛的语言理解任务测试,预训练大型语言模型已经取得了有重大意义的性能提升。然而,同一个模型无法应用于多模态的行为理解任务(例如视频情感/幽默检测),除非非语言特征(例如声学和视觉)可以与语言集成。联合建模多个模态显著增加了模型复杂性,并使训练过程需要更多数据。尽管通过网络可获取大量文本数据,但收集大规模的多模态行为视频数据集非常昂贵,无论是在时间还是金钱上。在本文中,我们研究了当非语言信息以文本形式呈现时,大型语言模型是否能成功地合并非语言信息。我们提出了一种方法,将声学和视觉信息转换为相应的文本描述,并将它们与口头文本拼接在一起。我们将这个增强的输入馈入预训练BERT模型,并在三个下游多模态任务上进行微调:情感、幽默和讽刺检测。我们的方法TextMI显著降低了模型的复杂性,增加了模型决策的可解释性,并且可以应用到不同的任务中,同时实现了较优的(多模态讽刺检测)或者接近SOTA的性能(多模态情感分析和多模态幽默检测)。我们提出TextMI作为多模态行为分析任务的一个通用、有竞争力的基准,特别适用于低资源环境下。