Speech summarization is typically performed by using a cascade of speech recognition and text summarization models. End-to-end modeling of speech summarization models is challenging due to memory and compute constraints arising from long input audio sequences. Recent work in document summarization has inspired methods to reduce the complexity of self-attentions, which enables transformer models to handle long sequences. In this work, we introduce a single model optimized end-to-end for speech summarization. We apply the restricted self-attention technique from text-based models to speech models to address the memory and compute constraints. We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos. The proposed end-to-end model outperforms the previously proposed cascaded model by 3 points absolute on ROUGE. Further, we consider the spoken language understanding task of predicting concepts from speech inputs and show that the proposed end-to-end model outperforms the cascade model by 4 points absolute F-1.
翻译:语音摘要通常通过使用语音识别和文本摘要模型的级联来进行。语音摘要模型的端到端建模由于记忆和计算长输入音频序列产生的限制而具有挑战性。文件摘要的近期工作启发了降低自省复杂性的方法,使变压器模型能够处理长序列。在这项工作中,我们引入了单一的单一模型,使语音摘要模型的端到端优化。我们从基于文字的模型到语音模型,将有限的自留技术应用到语音模型,以解决记忆和计算限制。我们证明,拟议的模型学习直接总结“如何-2套教学视频”的语音。拟议的端到端模型在ROUGE上以绝对的3点比先前提议的级联模型高出了3个。此外,我们考虑了从语音投入中预测概念的口头语言理解任务,并表明拟议的端到端模型比级模型的绝对F-1点超出4个级联模型。