End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large amount of paired data (i.e., speech and summary) is difficult, the training data is usually insufficient to train a robust E2E SSum system. In this paper, we present two novel methods that leverage a large amount of external text summarization data for E2E SSum training. The first technique is to utilize a text-to-speech (TTS) system to generate synthesized speech, which is used for E2E SSum training with the text summary. The second is a TTS-free method that directly inputs phoneme sequence instead of synthesized speech to the E2E SSum model. Experiments show that our proposed TTS- and phoneme-based methods improve several metrics on the How2 dataset. In particular, our best system outperforms a previous state-of-the-art one by a large margin (i.e., METEOR score improvements of more than 6 points). To the best of our knowledge, this is the first work to use external language resources for E2E SSum. Moreover, we report a detailed analysis of the How2 dataset to confirm the validity of our proposed E2E SSum system.
翻译:端到端语音总和( E2E SSum) 是一种直接从演讲中生成摘要句子的技术。 与将自动语音识别( ASR) 和文本总和模型相结合的级联方法相比, E2E 方法更有希望, 因为它会减少 ASR 错误, 包含非语言信息, 并简化了整个系统。 但是, 由于收集大量配对数据( 即, 语音和摘要) 十分困难, 培训数据通常不足以培训强大的 E2 E2 E SSum 系统。 在本文中, 我们提出了两种新颖的方法, 利用大量外部文本识别数据来为 E2 E SSum 培训提供外部文本识别和汇总数据。 第一种方法是利用文本对语音的合成系统生成综合语言。 第二种是免费技术技术, 直接输入电话序列而不是对 E2 E2 SSum 模型的合成语言系统。 实验显示, 我们提议的基于 TTS- 和 电话的 方法, 将大量的外部文本数据转换为 E2 的系统, 如何改进E 。</s>