We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner while also leveraging the generalization performance of transformer-based architectures. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. We provide code, dataset, and audio samples: https://waveformer.cs.washington.edu/.
翻译:我们提出了第一个实现实时和流式目标声音提取的神经网络模型。为了实现这一目标,我们提出了Waveformer,一种编码器 - 解码器架构,并使用一系列扩张因果卷积层作为编码器,以及一个变压器解码层作为解码器。这种混合架构使用扩张因果卷积以计算高效地处理大的感受野,同时利用基于变压器的架构的泛化性能。我们的评估显示,与先前的此任务的模型相比,SI-SNRi可以提高2.2-3.3 dB,同时具有1.2-4倍较小的模型大小和1.5-2倍较低的运行时间。我们提供代码、数据集和音频样本: https://waveformer.cs.washington.edu/。