Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral-Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of-the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.
翻译:变异器是一个成功的深神经网络(DNN)结构,它不仅在自然语言处理中,而且在音乐信息检索中都显示出其多功能性。在本文中,我们展示了一种新型的变异器处理击打和击败跟踪方法。这个方法采用了SpecTNT(变异器中的外观-时空变异器),这是变异器的一种变异器,该变异器既模拟了时频谱输入音乐音频的光谱和时间维度。SpecTNT模型使用一组块块,其中每个块由两个层次的变异器编码器组成。低级(或光谱)编码器处理光谱特性,并使该模型能够关注每个框架的相容构件。由于下游器显示条框的界限,而且往往伴有调和变器变化器的变化,这一步骤可能有助于下调模型。高层次(或时空)的摄像器聚合了有用的本地光谱信息,以引起对击/击败位置的注意。我们还提议一个结构,将SpectTRNT与状态模型结合起来,可以进一步改进我们的动态跟踪,同时进行可进行模拟的模拟。