Sounds, especially music, contain various harmonic components scattered in the frequency dimension. It is difficult for normal convolutional neural networks to observe these overtones. This paper introduces a multiple rates dilated causal convolution (MRDC-Conv) method to capture the harmonic structure in logarithmic scale spectrograms efficiently. The harmonic is helpful for pitch estimation, which is important for many sound processing applications. We propose HarmoF0, a fully convolutional network, to evaluate the MRDC-Conv and other dilated convolutions in pitch estimation. The results show that this model outperforms the DeepF0, yields state-of-the-art performance in three datasets, and simultaneously reduces more than 90% parameters. We also find that it has stronger noise resistance and fewer octave errors. The code and pre-trained model are available at https://github.com/WX-Wei/HarmoF0.
翻译:声音, 特别是音乐, 包含在频率维度中分散的各种调音元件。 正常的进化神经网络很难观察这些表面。 本文引入了一种多重速率膨胀因果共变( MRDC- Conv) 方法, 以有效捕捉对数比例谱光谱中的调和结构。 调音有助于定位估计, 这对于许多音频处理应用程序非常重要。 我们提议建立完全进化的网络 HarmoF0, 以评价MRDC- Conv 和投影中的其他变异。 结果显示, 这个模型在三个数据集中比 DeepF0, 产生最先进的性能, 同时减少超过 90%的参数。 我们还发现, 它有更强的噪音阻力, 更少的八度错误。 代码和预先训练的模型可以在 https://github. com/ WX- Wei/ HarmoF0 上查阅 。