Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approach, we propose a neural network-based virtual microphone estimator (NN-VME). The NN-VME estimates virtual microphone signals directly in the time domain, by utilizing the precise estimation capability of the recent time-domain neural networks. We adopt a fully supervised learning framework that uses actual observations at the locations of the virtual microphones at training time. Consequently, the NN-VME can be trained using only multi-channel observations and thus directly on real recordings, avoiding the need for unrealistic physical model-based assumptions. Experiments on the CHiME-4 corpus show that the proposed NN-VME achieves high virtual microphone estimation performance even for real recordings and that a beamformer augmented with the NN-VME improves both the speech enhancement and recognition performance.
翻译:由于许多装置的限制,为少量麦克风开发麦克风阵列技术十分重要。应对这种情况的一个方向是,根据若干物理模型假设,几乎增加麦克风信号的数量,例如,根据若干物理模型假设,但这种假设不一定在现实条件下得到满足。在本文中,作为一种替代办法,我们提议使用一个以神经网络为基础的虚拟麦克风显示器(NN-VME),作为替代方法。NN-VME利用最近时间-内容神经网络的精确估计能力,直接估计时间范围内的虚拟麦克风信号。我们采用了一个完全受监督的学习框架,在培训时使用虚拟麦克风的实际观测。因此,NNN-VME可以仅使用多频道观测器,直接进行实际录音培训,避免不切实际的物理模型假设。对CHimME-4的实验显示,拟议的NN-VME即使用于真实录音,也能够取得高虚拟麦克风估计性能,而且随着NN-VME的增强和识别性能都得到增强。