Disfluency, though originating from human spoken utterances, is primarily studied as a uni-modal text-based Natural Language Processing (NLP) task. Based on early-fusion and self-attention-based multimodal interaction between text and acoustic modalities, in this paper, we propose a novel multimodal architecture for disfluency detection from individual utterances. Our architecture leverages a multimodal dynamic fusion network that adds minimal parameters over an existing text encoder commonly used in prior art to leverage the prosodic and acoustic cues hidden in speech. Through experiments, we show that our proposed model achieves state-of-the-art results on the widely used English Switchboard for disfluency detection and outperforms prior unimodal and multimodal systems in literature by a significant margin. In addition, we make a thorough qualitative analysis and show that, unlike text-only systems, which suffer from spurious correlations in the data, our system overcomes this problem through additional cues from speech signals. We make all our codes publicly available on GitHub.
翻译:不确定性虽然源于人类口语,但主要作为单一模式基于文本的自然语言处理(NLP)任务来研究。基于文字和声学模式之间的早期融合和自我关注的多式联运互动,我们在本文件中提出一个新的多式联运结构,以便从个别的言语中发现不耐烦性。我们的建筑利用一个多式动态聚合网络,为以前艺术中常用的现有文本编码添加了最起码的参数,以利用语音中隐藏的预言和声学提示。我们通过实验,展示了我们提议的模型在广泛使用的英语调试器上取得了最先进的结果,从而在文献中大大超越了先前的单式和多式系统。此外,我们进行了彻底的质量分析,并表明,与只使用文本的系统不同,因为数据中存在尖锐的关联,我们的系统通过语言信号的更多提示克服了这个问题。我们在GitHub上公布了我们的所有代码。