Deep learning approaches for beat and downbeat tracking have brought advancements. However, these approaches continue to rely on hand-crafted, subsampled spectral features as input, restricting the information available to the model. In this work, we propose WaveBeat, an end-to-end approach for joint beat and downbeat tracking operating directly on waveforms. This method forgoes engineered spectral features, and instead, produces beat and downbeat predictions directly from the waveform, the first of its kind for this task. Our model utilizes temporal convolutional networks (TCNs) operating on waveforms that achieve a very large receptive field ($\geq$ 30 s) at audio sample rates in a memory efficient manner by employing rapidly growing dilation factors with fewer layers. With a straightforward data augmentation strategy, our method outperforms previous state-of-the-art methods on some datasets, while producing comparable results on others, demonstrating the potential for time domain approaches.
翻译:深入学习的击打和击落跟踪方法带来了进步。 但是,这些方法继续依赖手工制作的、分抽样的光谱特征作为输入,限制了模型可获得的信息。 在这项工作中,我们提出WaveBeat, 一种端对端的方法, 即直接在波形上进行联合击败和击败跟踪。 这种方法预示了设计光谱特征, 相反, 直接从波形中作出击击击和击败预测, 这是执行这项任务的第一个类型。 我们模型使用以波形运行的时间共振网络(TCNs)运行的波形, 以非常大的可接收场($\geq$ 30 s), 以记忆速率( 30 s) 的音频样本速度, 使用快速增长的变异系数, 以较少的层进行。 有了直接的数据增强战略, 我们的方法超越了某些数据集上以前的最先进的方法, 同时产生可比较的结果, 显示了时间域方法的潜力 。