We propose a semi-supervised approach to acoustic source localization in reverberant environments based on deep generative modeling. Localization in reverberant environments remains an open challenge. Even with large data volumes, the number of labels available for supervised learning in reverberant environments is usually small. We address this issue by performing semi-supervised learning (SSL) with convolutional variational autoencoders (VAEs) on reverberant speech signals recorded with microphone arrays. The VAE is trained to generate the phase of relative transfer functions (RTFs) between microphones, in parallel with a direction of arrival (DOA) classifier based on RTF-phase. These models are trained using both labeled and unlabeled RTF-phase sequences. In learning to perform these tasks, the VAE-SSL explicitly learns to separate the physical causes of the RTF-phase (i.e., source location) from distracting signal characteristics such as noise and speech activity. Relative to existing semi-supervised localization methods in acoustics, VAE-SSL is effectively an end-to-end processing approach which relies on minimal preprocessing of RTF-phase features. As far as we are aware, our paper presents the first approach to modeling the physics of acoustic propagation using deep generative modeling. The VAE-SSL approach is compared with two signal processing-based approaches, steered response power with phase transform (SRP-PHAT) and MUltiple SIgnal Classification (MUSIC), as well as fully supervised CNNs. We find that VAE-SSL can outperform the conventional approaches and the CNN in label-limited scenarios. Further, the trained VAE-SSL system can generate new RTF-phase samples, which shows the VAE-SSL approach learns the physics of the acoustic environment. The generative modeling in VAE-SSL thus provides a means of interpreting the learned representations.
翻译:我们建议采用半监督方法,在深层基因模型的基础上,在回声环境中对声源本地化采取声音源本地化做法。回声环境中的本地化仍然是一个公开的挑战。即使数据量很大,在回响环境中用于监督学习的标签数量通常也很小。我们通过对通过麦克风阵列录的变异自动读音信号进行半监督学习(SSL)来解决这个问题。VASSSSSSS在麦克风之间生成相对传输功能(RTFs)阶段,同时对RTF阶段的到达方向(DOAAT)分类。这些模型在使用标签和无标签的 RTF 阶段进行训练。在学习执行这些任务时,VAE-SL 明确学会将RTF阶段的物理原因(i.e.,源位置)从分散信号特性,如噪音和语音反应方法(SISSSSS),相对于在声学中的半超本地化本地化方法,VA-SAS-L 级级级级级升级方法是我们所了解的纸质处理方式。