We consider the problem of audio voice separation for binaural applications, such as earphones and hearing aids. While today's neural networks perform remarkably well (separating $4+$ sources with 2 microphones) they assume a known or fixed maximum number of sources, K. Moreover, today's models are trained in a supervised manner, using training data synthesized from generic sources, environments, and human head shapes. This paper intends to relax both these constraints at the expense of a slight alteration in the problem definition. We observe that, when a received mixture contains too many sources, it is still helpful to separate them by region, i.e., isolating signal mixtures from each conical sector around the user's head. This requires learning the fine-grained spatial properties of each region, including the signal distortions imposed by a person's head. We propose a two-stage self-supervised framework in which overheard voices from earphones are pre-processed to extract relatively clean personalized signals, which are then used to train a region-wise separation model. Results show promising performance, underscoring the importance of personalization over a generic supervised approach. (audio samples available at our project website: https://uiuc-earable-computing.github.io/binaural/. We believe this result could help real-world applications in selective hearing, noise cancellation, and audio augmented reality.
翻译:我们考虑的是双声应用(例如耳机和助听器等)的音频分离问题。今天的神经网络虽然表现非常出色(将4+美元来源与2个麦克风分开),但它们假定了已知或固定的最大来源数量,K。此外,今天的模型是以监督方式培训的,使用了从通用来源、环境和人头形状合成的培训数据。本文打算放松这两种限制,而忽视问题定义的轻微改变。我们注意到,当收到的混合物含有过多的来源时,按区域区分它们仍然有帮助,即将信号混合物从用户头上的每个调音部门分离出来。这需要了解每个区域的细微空间特性,包括个人头部的信号扭曲。我们提议一个两阶段的自我监督框架,让耳机听到的声音得到预处理,以提取相对清洁的个人化信号,然后用来训练区域分解模式。结果显示有希望的业绩,强调个人化的重要性,超越通用的监控性应用。我们网站的语音/网络。(可获取的样本)相信,我们网站的语音/网络。