This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA). Although the effectiveness of pre-trained encoders in various fields has been reported, conventional MSA methods employ them for only linguistic modality, and their application has not been investigated. This paper compares the features yielded by large-scale pre-trained encoders with conventional heuristic features. One each of the largest pre-trained encoders publicly available for each modality are used; CLIP-ViT, WavLM, and BERT for visual, acoustic, and linguistic modalities, respectively. Experiments on two datasets reveal that methods with domain-specific pre-trained encoders attain better performance than those with conventional features in both unimodal and multimodal scenarios. We also find it better to use the outputs of the intermediate layers of the encoders than those of the output layer. The codes are available at https://github.com/ando-hub/MSA_Pretrain.
翻译:本文件调查了用于多式联运情绪分析的各种特定模式的大型预先培训的大规模编码器(MSA)的效用和实施情况。虽然已经报告了在各个领域经过事先培训的编码器的有效性,但传统的管理事务协议方法只对语言模式使用,其应用尚未调查。本文件比较了具有传统体温特征的经过培训的大规模编码器所产生的特征。每个在每种模式上公开提供的最大预先培训的编码器都使用其中之一;CLIP-ViT、WavLM和BERT分别用于视觉、声学和语言模式。两个数据集的实验显示,在单一形式和多式联运情况下,与特定领域事先培训的编码器的方法相比,其性能优于具有传统特征的方法。我们还认为,使用编码器中间层的输出器比输出层的输出器要好。这些代码可在https://github.com/ando-hub/MSA_Pretrain查阅。