Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in practice, which causes performance degradation. In this paper, we address the problem of mask-based beamforming for moving sources. We first review classical approaches to tracking a moving source, which perform online or blockwise computation of the SCMs. We show that these approaches can be interpreted as computing a sum of instantaneous SCMs weighted by attention weights. These weights indicate which time frames of the signal to consider in the SCM computation. Online or blockwise computation assumes a heuristic and deterministic way of computing these attention weights that, although simple, may not result in optimal performance. We thus introduce a learning-based framework that computes optimal attention weights for beamforming. We achieve this using a neural network implemented with self-attention layers. We show experimentally that our proposed framework can greatly improve beamforming performance in moving source situations while maintaining high performance in non-moving situations, thus enabling the development of mask-based beamformers robust to source movements.
翻译:光学成形是一种强大的工具,旨在从目标源的方向上增强言语信号。计算光成过滤器需要估计源和噪音信号的空间共变矩阵(SCM),经常使用时频遮罩来计算这些 sCCM。大多数基于遮罩的波束成形研究都假定源不会移动。但是,源往往在实际中移动,导致性能退化。在本文中,我们处理移动源以遮罩为基的波束成形的问题。我们首先审查追踪移动源的典型方法,这些源进行在线或断层计算。我们表明,这些方法可以被解释为计算按注意重量加权的瞬时SCM总和。这些权重表明在SCM计算时要考虑的信号的时间框架。在线或阻断式的计算方法假定了这些注意权重的偏移,虽然简单,但不会导致最佳性能。我们因此引入了一个基于学习的框架,用以计算最优的注意力重度来进行成型。我们用一个内基网络来计算,我们用这个网络来计算出一个按注意的同步结构来进行在线或断断时段式计算,我们所执行的SSCM 。我们用的是按着重度,在高的运行状态下进行不动的状态中,在高的状态下,我们可以显示高性能的状态下进行试验。我们展示。我们可以显示高的状态的状态上显示。