There are many important applications for detecting and localizing specific sound events within long, untrimmed documents including keyword spotting, medical observation, and bioacoustic monitoring for conservation. Deep learning techniques often set the state-of-the-art for these tasks. However, for some types of events, there is insufficient labeled data to train deep learning models. In this paper, we propose novel approaches to few-shot sound event detection utilizing region proposals and the Perceiver architecture, which is capable of accurately localizing sound events with very few examples of each class of interest. Motivated by a lack of suitable benchmark datasets for few-shot audio event detection, we generate and evaluate on two novel episodic rare sound event datasets: one using clips of celebrity speech as the sound event, and the other using environmental sounds. Our highest performing proposed few-shot approaches achieve 0.575 and 0.672 F1-score, respectively, with 5-shot 5-way tasks on these two datasets. These represent absolute improvements of 0.200 and 0.234 over strong proposal-free few-shot sound event detection baselines.
翻译:用于探测和定位特定声音事件的许多重要应用应用是长长的、未剪切的文件,包括关键词定位、医疗观察和生物声学监测,以进行保护。深层学习技术往往为这些任务设定了最先进的技术。然而,对于某些类型的事件,没有贴标签的数据来培训深层学习模式。在本文中,我们提出了利用区域建议和 Perceiver 结构对微小声音事件探测采用新颖方法,这种方法能够精确定位声音事件,而每一类感兴趣的例子很少。 由于缺乏用于少量声音事件探测的适当基准数据集,我们产生并评估了两种新颖的稀有突发事件数据集:一种是使用名人讲话剪片作为声音事件,另一种是使用环境声音。我们最高级的拟议微小方法分别达到0.575和0.672 F1核心,在这两个数据集上完成5分的5分5分的5线任务。这代表了0.200和0.234的绝对改进,超过了没有建议的强的微小声音事件探测基线。