Modern noise-cancelling headphones have significantly improved users' auditory experiences by removing unwanted background noise, but they can also block out sounds that matter to users. Machine learning (ML) models for sound event detection (SED) and speaker identification (SID) can enable headphones to selectively pass through important sounds; however, implementing these models for a user-centric experience presents several unique challenges. First, most people spend limited time customizing their headphones, so the sound detection should work reasonably well out of the box. Second, the models should be able to learn over time the specific sounds that are important to users based on their implicit and explicit interactions. Finally, such models should have a small memory footprint to run on low-power headphones with limited on-chip memory. In this paper, we propose addressing these challenges using HiSSNet (Hierarchical SED and SID Network). HiSSNet is an SEID (SED and SID) model that uses a hierarchical prototypical network to detect both general and specific sounds of interest and characterize both alarm-like and speech sounds. We show that HiSSNet outperforms an SEID model trained using non-hierarchical prototypical networks by 6.9 - 8.6 percent. When compared to state-of-the-art (SOTA) models trained specifically for SED or SID alone, HiSSNet achieves similar or better performance while reducing the memory footprint required to support multiple capabilities on-device.
翻译:现代消音耳机通过消除不必要的背景噪音,大大改善了用户的听觉经验,消除了不必要的背景噪音,但也能够阻断用户认为重要的声音。 机器学习(ML)声音检测(SED)模型和语音识别(SID)模型可以让耳机有选择地通过重要声音传递; 然而,实施这些以用户为中心的模型带来了一些独特的挑战。 首先,大多数人花有限的时间定制耳机,因此声音检测应该合理顺利地从盒子里抽出。 其次,模型应该能够随着时间而了解对用户以其隐含和明确互动为基础很重要的具体声音。 最后,这些模型应该有一个小的记忆足迹,用低功率的耳机检测(SED)模型,用高等级SESSNet(SED)和SID(SID(SED(SED))网络,使用等级分级的热门性能检测一般和特定的兴趣声音,并描述类似和语音声音的声音。我们显示,HSSNet(SISNet(SISNet)应该用经过专门训练的SISA(SIS-SISISA)模型,然后用SISISISMA(SISISISD(SAS-SAS-S-S-SIS-SIR-S-SIS-SIS-SIR-SIS-SIS-SAS-SAS-S-SAS-S-S-S-S-S-SAS-S-S-S-S-S-S-S-S-S-S-S-S-S-SIR-SIR-S-S-SIS-S-S-SIR-SIS-S-S-S-S-S-S-SIR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SIS-A-S-S-S-S-S-SIS-SIS-SIS-SIS-SIS-SIS-SIS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-A-A-A-A-A-</s>