Sounds, especially music, contain various harmonic components scattered in the frequency dimension. It is difficult for normal convolutional neural networks to observe these overtones. This paper introduces a multiple rates dilated causal convolution (MRDC-Conv) method to capture the harmonic structure in logarithmic scale spectrograms efficiently. The harmonic is helpful for pitch estimation, which is important for many sound processing applications. We propose HarmoF0, a fully convolutional network, to evaluate the MRDC-Conv and other dilated convolutions in pitch estimation. The results show that this model outperforms the DeepF0, yields state-of-the-art performance in three datasets, and simultaneously reduces more than 90% parameters. We also find that it has stronger noise resistance and fewer octave errors.
翻译:声音, 特别是音乐, 包含在频率维度中分散的各种调音元件。 正常的进化神经网络很难观测这些表面。 本文引入了一种多重速率膨胀因果共变( MRDC- Conv) 方法, 以有效捕捉对数比例谱光谱中的调和结构。 调音有助于定位估计, 这对于许多音频处理应用程序非常重要。 我们提议建立完全进化的网络 HarmoF0, 以评价MRDC- Conv 和在投影估计中的其他变异。 结果表明, 这个模型比 DeepF0 模型的变异, 产生三个数据集中最先进的性能, 同时减少超过90%的参数。 我们还发现它具有更强的噪音阻力, 并减少八进差。