General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks. The embeddings are typically extracted from models that are built on general tasks such as skip-gram models and natural language generation. In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks. The embeddings in our network are extracted using the encoder of a transformer model trained using multi-task training. We use person identification and automatic speech recognition as the tasks in our embedding generation framework. We tune and evaluate the embeddings on the downstream task of emotion recognition and demonstrate that on the CMU-MOSEI dataset, the embeddings can be used to improve over previous state of the art results.
翻译:普通嵌入如 word2vec、 GloVe 和 ELMO 等通用嵌入模块显示在自然语言任务中取得了许多成功。 嵌入模块通常取自基于跳格模型和自然语言生成等一般任务的模型。 在本文中, 我们将工作从自然语言理解扩展至多模式架构, 将音频、 视觉和文字信息用于机器学习任务。 我们网络嵌入模块使用一个通过多任务培训培训的变压器模型的编码器进行提取。 我们使用个人识别和自动语音识别作为嵌入生成框架中的任务。 我们调和评估情感识别下游任务中的嵌入模块, 并展示在 CMU- MOSEI 数据集中, 嵌入模块可用于改善艺术成果的先前状态 。