Filler words such as `uh' or `um' are sounds or words people use to signal they are pausing to think. Finding and removing filler words from recordings is a common and tedious task in media editing. Automatically detecting and classifying filler words could greatly aid in this task, but few studies have been published on this problem to date. A key reason is the absence of a dataset with annotated filler words for model training and evaluation. In this work, we present a novel speech dataset, PodcastFillers, with 35K annotated filler words and 50K annotations of other sounds that commonly occur in podcasts such as breaths, laughter, and word repetitions. We propose a pipeline that leverages VAD and ASR to detect filler candidates and a classifier to distinguish between filler word types. We evaluate our proposed pipeline on PodcastFillers, compare to several baselines, and present a detailed ablation study. In particular, we evaluate the importance of using ASR and how it compares to a transcription-free approach resembling keyword spotting. We show that our pipeline obtains state-of-the-art results, and that leveraging ASR strongly outperforms a keyword spotting approach. We make PodcastFillers publicly available, in the hope that our work serves as a benchmark for future research.
翻译:查找和删除录音中的填充词是媒体编辑中常见和乏味的任务。自动检测和分类填充词对这项任务大有帮助,但迄今为止,关于这一问题的研究很少。一个关键的原因是缺少配有附加说明的填充词的数据集,用于示范培训和评估。在这项工作中,我们展示了一个新的语音数据集,Podcast Fillers,配有35K附加注释的填充词和50K说明,这些词通常在播客中出现的其他声音,如呼吸、笑声和重复字。我们建议建立一个利用 VAD和ASR检测填充词候选人的管道和一个分类器来区分填充词类型。我们评估了在PodcastFillers上的拟议管道,与几个基线进行比较,并提交一份详细的缩略研究。我们特别评估了使用ASR的重要性,以及它如何与免抄录方法相比,例如呼吸、笑和重复字。我们提出了一条利用VAAD和ASR的管道来探测填充词候选人和区分填充词类型的方法。我们评估了在Pocastfilleal上展示了我们的基准定位的定位结果。