TextMI：文本化多模态信息，为预训练语言模型集成非语言提示 (TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models)

Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sentiment/humor detection) unless non-verbal features (e.g., acoustic and visual) can be integrated with language. Jointly modeling multiple modalities significantly increases the model complexity, and makes the training process data-hungry. While an enormous amount of text data is available via the web, collecting large-scale multimodal behavioral video datasets is extremely expensive, both in terms of time and money. In this paper, we investigate whether large language models alone can successfully incorporate non-verbal information when they are presented in textual form. We present a way to convert the acoustic and visual information into corresponding textual descriptions and concatenate them with the spoken text. We feed this augmented input to a pre-trained BERT model and fine-tune it on three downstream multimodal tasks: sentiment, humor, and sarcasm detection. Our approach, TextMI, significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks while achieving superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment analysis and multimodal humor detection) performance. We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks, particularly in a low-resource setting.

翻译：最近，经过广泛的语言理解任务测试，预训练大型语言模型已经取得了有重大意义的性能提升。然而，同一个模型无法应用于多模态的行为理解任务（例如视频情感/幽默检测），除非非语言特征（例如声学和视觉）可以与语言集成。联合建模多个模态显著增加了模型复杂性，并使训练过程需要更多数据。尽管通过网络可获取大量文本数据，但收集大规模的多模态行为视频数据集非常昂贵，无论是在时间还是金钱上。在本文中，我们研究了当非语言信息以文本形式呈现时，大型语言模型是否能成功地合并非语言信息。我们提出了一种方法，将声学和视觉信息转换为相应的文本描述，并将它们与口头文本拼接在一起。我们将这个增强的输入馈入预训练BERT模型，并在三个下游多模态任务上进行微调：情感、幽默和讽刺检测。我们的方法TextMI显著降低了模型的复杂性，增加了模型决策的可解释性，并且可以应用到不同的任务中，同时实现了较优的（多模态讽刺检测）或者接近SOTA的性能（多模态情感分析和多模态幽默检测）。我们提出TextMI作为多模态行为分析任务的一个通用、有竞争力的基准，特别适用于低资源环境下。