Music source separation involves a large input field to model a long-term dependence of an audio signal. Previous convolutional neural network (CNN) -based approaches address the large input field modeling using sequentially down- and up-sampling feature maps or dilated convolution. In this paper, we claim the importance of a rapid growth of a receptive field and a simultaneous modeling of multi-resolution data in a single convolution layer, and propose a novel CNN architecture called densely connected dilated DenseNet (D3Net). D3Net involves a novel multi-dilated convolution that has different dilation factors in a single layer to model different resolutions simultaneously. By combining the multi-dilated convolution with DenseNet architecture, D3Net avoids the aliasing problem that exists when we naively incorporate the dilated convolution in DenseNet. Experimental results on MUSDB18 dataset show that D3Net achieves state-of-the-art performance with an average signal to distortion ratio (SDR) of 6.01 dB.
翻译:音乐源分离涉及一个大型输入字段, 用于模拟音频信号的长期依赖性。 先前的进化神经网络( CNN) 以进化神经网络( CNN) 为基础的方法, 处理使用相继下下游和上层取样地貌地图或放大变形的大型输入场建模。 在本文中, 我们声称, 快速增长一个可接收字段和同时建模多分辨率数据在单一变化层中的重要性, 并提议一个新型的CNN 结构, 名为“ 密集连接的DenseNet ( D3Net) ” ( D3Net) 。 D3Net 涉及一个新颖的多层演化演化, 在单个层中具有不同的变异系数。 通过将多层演化与 DenseNet 结构相结合, D3Net 避免了当我们天真地将DenseNet 的变形变形变形纳入时出现的问题。 MUSDB18 数据集的实验结果显示, D3Net 以6.01 dB 的平均扭曲率信号( SDR) 实现状态性。