Beamforming for multichannel speech enhancement relies on the estimation of spatial characteristics of the acoustic scene. In its simplest form, the delay-and-sum beamformer (DSB) introduces a time delay to all channels to align the desired signal components for constructive superposition. Recent investigations of neural spatiospectral filtering revealed that these filters can be characterized by a beampattern similar to one of traditional beamformers, which shows that artificial neural networks can learn and explicitly represent spatial structure. Using the Complex-valued Spatial Autoencoder (COSPA) as an exemplary neural spatiospectral filter for multichannel speech enhancement, we investigate where and how such networks represent spatial information. We show via clustering that for COSPA the spatial information is represented by the features generated by a gated recurrent unit (GRU) layer that has access to all channels simultaneously and that these features are not source -- but only direction of arrival-dependent.
翻译:用于多通道语音增强的光谱显示,这些过滤器的特征可以与传统光谱仪相似,显示人造神经网络可以学习并明确代表空间结构。以最简单的形式,延迟和总光谱(DSB)对所有渠道都引入了时间延迟,以便调整所需的信号组件,以便进行建设性的叠加。最近对神经丙基光谱过滤的调查显示,这些过滤器的特征可以与传统光谱仪相似,这表明人造神经网络可以学习并明确代表空间结构。我们利用复杂价值的空间自动光谱仪(COSPA)作为多通道语音增强的模范神经孔光谱过滤器,我们调查这些网络在哪些方面和如何代表空间信息。我们通过集群显示,对于COSPA而言,空间信息是由同时能够进入所有频道的封闭式经常性单元(GRU)生成的特征所代表,这些特征不是源,而只是靠到达的方向。</s>