We present a neural network for rendering binaural speech from given monaural audio, position, and orientation of the source. Most of the previous works have focused on synthesizing binaural speeches by conditioning the positions and orientations in the feature space of convolutional neural networks. These synthesis approaches are powerful in estimating the target binaural speeches even for in-the-wild data but are difficult to generalize for rendering the audio from out-of-distribution domains. To alleviate this, we propose Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space. Specifically, utilizing a geometric time delay based on the distance between the source and the receiver, NFS is trained to predict the delays and scales of various early reflections. NFS is efficient in both memory and computational cost, is interpretable, and operates independently of the source domain by its design. With up to 25 times lighter memory and 6 times fewer calculations, the experimental results show that NFS outperforms the previous studies on the benchmark dataset.
翻译:我们展示了一个神经网络,用于从源头的音频、位置和方向上进行双声讲话。以前的大部分作品都侧重于通过调整进化神经网络特征空间的位置和方向来合成双声讲话。这些合成方法在估计目标双声讲话方面是强大的,即使是在瞬间数据也是如此,但很难概括用于从传播外域获取音频。为了减轻这一影响,我们提议了Neural Fourier Shift(NFS),这是一个新的网络结构,使Fourier空间能够进行双声讲话。具体地说,利用基于源与接收者距离的几何时间延迟,NFS受过培训,可以预测各种早期反射的延迟和规模。NFS在记忆和计算成本上都是高效的,是可以解释的,并且通过设计独立于源域运作。由于有25倍的记忆和6倍的计算,实验结果显示NFSFS超越了先前的基准数据集研究。