Filler words like ``um" or ``uh" are common in spontaneous speech. It is desirable to automatically detect and remove them in recordings, as they affect the fluency, confidence, and professionalism of speech. Previous studies and our preliminary experiments reveal that the biggest challenge in filler word detection is that fillers can be easily confused with other hard categories like ``a" or ``I". In this paper, we propose a novel filler word detection method that effectively addresses this challenge by adding auxiliary categories dynamically and applying an additional inter-category focal loss. The auxiliary categories force the model to explicitly model the confusing words by mining hard categories. In addition, inter-category focal loss adaptively adjusts the penalty weight between ``filler" and ``non-filler" categories to deal with other confusing words left in the ``non-filler" category. Our system achieves the best results, with a huge improvement compared to other methods on the PodcastFillers dataset.
翻译:填充词(例如“嗯”或“啊”)在自然语言中经常出现。自动检测然后去除这些词语对于提高录音的流畅度、自信心和专业性是有益的。但相比于其他硬类别,“填充词”更难以识别。本文提出了一种新的填充词检测方法,通过动态添加辅助类别和应用额外的类别内焦点损失来解决这个问题。辅助类别会促使模型针对硬类别进行明确建模。类别内焦点损失还会自适应地调整“填充词”和“非填充词”类别之间的惩罚权重,以解决在“非填充词”类别中剩余的其他混淆词语。我们的方法在PodcastFillers数据集上取得了最佳效果,并相较于其他方法有了巨大的提升。