We propose Beat Transformer, a novel Transformer encoder architecture for joint beat and downbeat tracking. Different from previous models that track beats solely based on the spectrogram of an audio mixture, our model deals with demixed spectrograms with multiple instrument channels. This is inspired by the fact that humans perceive metrical structures from richer musical contexts, such as chord progression and instrumentation. To this end, we develop a Transformer model with both time-wise attention and instrument-wise attention to capture deep-buried metrical cues. Moreover, our model adopts a novel dilated self-attention mechanism, which achieves powerful hierarchical modelling with only linear complexity. Experiments demonstrate a significant improvement in demixed beat tracking over the non-demixed version. Also, Beat Transformer achieves up to 4% point improvement in downbeat tracking accuracy over the TCN architectures. We further discover an interpretable attention pattern that mirrors our understanding of hierarchical metrical structures.
翻译:我们提出Beat 变换器,这是一个用于联合击败跟踪的新型变换器编码器结构。与以往只根据音频混合物的光谱图进行跟踪的模型不同,我们的模型处理的是多台仪器频道的混混光谱。这受到人类从较丰富的音乐环境(如合奏进化和仪器仪表)中看到指标结构的启发。为此,我们开发了一个具有时间角度关注和仪器角度关注的变换器模型,以捕捉深埋的公标提示。此外,我们的模型采用了一种新颖的扩大自我注意机制,它只实现线性复杂的强大等级建模。实验显示,在对非调化版本的解混拍跟踪方面有了显著改进。此外,击变换器在跟踪TCN结构的精度方面实现了高达4个百分点的改进。我们进一步发现了一种可解释的注意模式,它反映了我们对等级测量结构的理解。