In this paper, we propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm, which is an unsupervised model without dependence on intermediate transcribed text. Different from previous methods with text presentation, we are aimed at generating a summary directly from speech without transcription. First, a set of smaller speech segments are extracted based on speech signal's acoustic features. For each candidate speech segment, a distance-based summarization confidence score is designed for latent speech representation measure. Specifically, we leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio. Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length. Extensive results on two well-known meeting datasets (AMI and ICSI corpora) show the effectiveness of our direct speech-based method to improve the summarization quality with untranscribed data. We also observe that our unsupervised speech-based method even performs on par with recent transcript-based summarization approaches, where extra speech recognition is required.
翻译:在本文中,我们提出一个直接采掘语音到语音的合成的新结构,即ESSumm,这是一个不受监督的模型,不依赖中间转录文本。与以往的文本演示方法不同,我们的目标是直接从发言中产生摘要,不转录。首先,根据语音信号的声学特征抽取一套较小的部分。对于每个候选演讲段,都设计了一种基于远程的合成信任分,用于暗中语音表达测量。具体地说,我们利用现成的自监督的神经神经网络从原始音频中提取深层的语音特征。我们的方法自动预测了获取关键信息的最佳发言段顺序,有目标摘要长度。关于两个众所周知的会议数据集(美国和ICSI Corpora)的广泛结果显示我们直接以语音为基础的方法的有效性,用未标注的数据来提高组合质量。我们还注意到,我们未经监督的语音组合方法甚至以最近的笔记式总结方法为基础,需要额外的语音识别。