Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sentiment/humor detection) unless non-verbal features (e.g., acoustic and visual) can be integrated with language. Jointly modeling multiple modalities significantly increases the model complexity, and makes the training process data-hungry. While an enormous amount of text data is available via the web, collecting large-scale multimodal behavioral video datasets is extremely expensive, both in terms of time and money. In this paper, we investigate whether large language models alone can successfully incorporate non-verbal information when they are presented in textual form. We present a way to convert the acoustic and visual information into corresponding textual descriptions and concatenate them with the spoken text. We feed this augmented input to a pre-trained BERT model and fine-tune it on three downstream multimodal tasks: sentiment, humor, and sarcasm detection. Our approach, TextMI, significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks while achieving superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment analysis and multimodal humor detection) performance. We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks, particularly in a low-resource setting.
翻译:最近,预训练大型语言模型在各种语言理解任务中取得了突破性的性能。然而,相同的模型不能应用于多模态行为理解任务(例如,视频情感/幽默检测),除非非语言特征(例如声学和视觉)可以与语言集成。共同建模多种模式显著增加了模型复杂性,并使训练过程需要大量数据。尽管通过网络提供了大量文本数据,但收集大规模多模态行为视频数据集非常昂贵,无论是时间还是金钱。在本文中,我们研究了当非语言信息以文本形式呈现时,大型语言模型是否可以成功地合并非语言信息。我们提出了一种将声学和视觉信息转换为相应文本描述并将其与口语文本连接的方法。我们将这个增强输入提供给预训练BERT模型,并在三个下游多模态任务上进行微调:情感,幽默和讽刺检测。我们的方法TextMI大大降低了模型复杂度,增加了模型决策的可解释性,并在多种任务中应用,同时实现了超越最佳的(多模态讽刺检测)或接近最佳的(多模态情感分析和多模态幽默检测)表现。我们建议TextMI作为多模态行为分析任务的通用,具有竞争力的基线,特别是在低资源设置中。