As the volume of long-form spoken-word content such as podcasts explodes, many platforms desire to present short, meaningful, and logically coherent segments extracted from the full content. Such segments can be consumed by users to sample content before diving in, as well as used by the platform to promote and recommend content. However, little published work is focused on the segmentation of spoken-word content, where the errors (noise) in transcripts generated by automatic speech recognition (ASR) services poses many challenges. Here we build a novel dataset of complete transcriptions of over 400 podcast episodes, in which we label the position of introductions in each episode. These introductions contain information about the episodes' topics, hosts, and guests, providing a valuable summary of the episode content, as it is created by the authors. We further augment our dataset with word substitutions to increase the amount of available training data. We train three Transformer models based on the pre-trained BERT and different augmentation strategies, which achieve significantly better performance compared with a static embedding model, showing that it is possible to capture generalized, larger-scale structural information from noisy, loosely-organized speech data. This is further demonstrated through an analysis of the models' inner architecture. Our methods and dataset can be used to facilitate future work on the structure-based segmentation of spoken-word content.
翻译:由于播客等长式口语内容的数量爆炸,许多平台希望展示从完整内容中提取的短、有意义和逻辑一致的部分,这些部分可以在跳进之前被用户用于样本内容,然后被平台用于宣传和建议内容。然而,很少出版的工作侧重于口语内容的分解,而通过自动语音识别(ASR)服务生成的记录错误(噪音)带来了许多挑战。我们在这里建立了400多个播客片段完整抄录的新数据集,我们在其中标出每集介绍内容的位置。这些介绍包含关于节目主题、主机和客人的信息,对节目内容进行了宝贵的摘要,正如作者所创建的那样。我们进一步用换字来增加现有培训数据的数量。我们根据事先经过培训的BERT和不同的增强战略,培训了三种变换模型,这些模型的性能比静态嵌入模型要好得多,表明我们有可能从噪音、松散的演讲内容结构中获取普遍、大尺度的结构信息。通过内部分析,可以进一步展示我们使用的数据结构。