In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.
翻译:在本文中,我们提出一个多式演讲和文本输入的新结构。我们使用多式跨式关注和共同调整目标问题的方法,将预先培训的演讲和文本编码器结合起来。由此产生的结构可以用来对同时文本和语言进行连续的象征性分类或语句级预测。由此产生的编码器有效地捕捉了声学和词汇学信息。我们比较了多式演讲分类的多式关注混合的好处,与集中的、特定模式的简单组合。我们的模型结构是紧凑的、资源效率高的,并且可以用单一的消费者GPU卡接受培训。