The human ability to track musical downbeats is robust to changes in tempo, and it extends to tempi never previously encountered. We propose a deterministic time-warping operation that enables this skill in a convolutional neural network (CNN) by allowing the network to learn rhythmic patterns independently of tempo. Unlike conventional deep learning approaches, which learn rhythmic patterns at the tempi present in the training dataset, the patterns learned in our model are tempo-invariant, leading to better tempo generalisation and more efficient usage of the network capacity. We test the generalisation property on a synthetic dataset created by rendering the Groove MIDI Dataset using FluidSynth, split into a training set containing the original performances and a test set containing tempo-scaled versions rendered with different SoundFonts (test-time augmentation). The proposed model generalises nearly perfectly to unseen tempi (F-measure of 0.89 on both training and test sets), whereas a comparable conventional CNN achieves similar accuracy only for the training set (0.89) and drops to 0.54 on the test set. The generalisation advantage of the proposed model extends to real music, as shown by results on the GTZAN and Ballroom datasets.
翻译:人类追踪音乐下游的能力强于节奏的变化,它延伸到了从未遇到过的节奏。我们提议了一种决定性的时间扭曲操作,通过允许网络学习节奏模式而不受节奏的影响,使网络能够独立地学习节奏模式。与传统的深层次学习方法不同,这些方法学习了在培训数据集中显示的节奏的节奏模式,我们模型中学习的模式是节奏异质的,导致网络能力的更节奏化和更有效的使用。我们用FluidSynth将Groove MIDI数据集分成成一个合成数据集,我们用FluidSynth制成一个合成数据集,使这一技能能够被分成一套包含原始性能的训练数据集,以及一套包含由不同SoundFonts(测试时间增强)制作的节奏节奏版本的测试数据集。拟议的模型一般化优势几乎是看不见的节奏(在培训和测试套件中都采用0.89的F-度测量法度,而类似的常规CNN只为训练集(0.89)取得相似的精确性,在测试集上跌至0.54。