Recently, the research on ad-hoc microphone arrays with deep learning has drawn much attention, especially in speech enhancement and separation. Because an ad-hoc microphone array may cover such a large area that multiple speakers may locate far apart and talk independently, target-dependent speech separation, which aims to extract a target speaker from a mixed speech, is important for extracting and tracing a specific speaker in the ad-hoc array. However, this technique has not been explored yet. In this paper, we propose deep ad-hoc beamforming based on speaker extraction, which is to our knowledge the first work for target-dependent speech separation based on ad-hoc microphone arrays and deep learning. The algorithm contains three components. First, we propose a supervised channel selection framework based on speaker extraction, where the estimated utterance-level SNRs of the target speech are used as the basis for the channel selection. Second, we apply the selected channels to a deep learning based MVDR algorithm, where a single-channel speaker extraction algorithm is applied to each selected channel for estimating the mask of the target speech. We conducted an extensive experiment on a WSJ0-adhoc corpus. Experimental results demonstrate the effectiveness of the proposed method.
翻译:最近,对具有深层学习的特设麦克风阵列的研究引起了人们的极大注意,特别是在语音增强和分离方面。因为一个特设的热声麦克风阵列可能覆盖如此大的范围,使多个发言者可能分得很远,可以独立交谈,因此,为了从混合式演讲中抽出一个目标演讲者,对提取和追踪特设声阵列中某个特定演讲者很重要。然而,这一技术尚未探索。在本文中,我们提议根据扩音器提取法进行深层的自动声阵列成型,这是我们了解的根据特别麦克风阵列和深层学习进行依赖目标的语音分离的首项工作。算法包含三个组成部分。首先,我们提议以扩音器提取法为基础,以目标演讲者估计的超音层SRR为选择框架作为选择频道的基础。第二,我们将选定的频道用于基于MVDR算法的深层次学习,将单声调音调调算法应用于每个选定的频道,以估计目标演讲的面罩。我们用WSJ0-adhochoalimage 展示了拟议的实验结果。