This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94%, and improves the F1 score by 6.5\%.
翻译:本文探讨了在语音数据中使用视觉物体探测技术定位词的可能性。 在当代文献中,物体探测已经对视觉数据进行了彻底研究。 注意到音频可以被解释为一维图像,对象定位技术可以从根本上用于文字定位。 基于这个想法,我们提出了对字检测和定位的轻量级解决方案。我们用捆绑框回归法来定位字词,这使我们的模型能够检测特定音流中关键词的发生、抵消和持续时间。我们实验了LibriSpeech, 并训练了将1000个字进行本地化的模型。与现有工作相比,我们的方法将模型的大小减少了94%,并将F1评分提高了6.5 ⁇ 。