In this work, we aim to improve the expressive capacity of waveform-based discriminative music networks by modeling both sequential (temporal) and hierarchical information in an efficient end-to-end architecture. We present MuSLCAT, or Multi-scale and Multi-level Convolutional Attention Transformer, a novel architecture for learning robust representations of complex music tags directly from raw waveform recordings. We also introduce a lightweight variant of MuSLCAT called MuSLCAN, short for Multi-scale and Multi-level Convolutional Attention Network. Both MuSLCAT and MuSLCAN model features from multiple scales and levels by integrating a frontend-backend architecture. The frontend targets different frequency ranges while modeling long-range dependencies and multi-level interactions by using two convolutional attention networks with attention-augmented convolution (AAC) blocks. The backend dynamically recalibrates multi-scale and level features extracted from the frontend by incorporating self-attention. The difference between MuSLCAT and MuSLCAN is their backend components. MuSLCAT's backend is a modified version of BERT. While MuSLCAN's is a simple AAC block. We validate the proposed MuSLCAT and MuSLCAN architectures by comparing them to state-of-the-art networks on four benchmark datasets for music tagging and genre recognition. Our experiments show that MuSLCAT and MuSLCAN consistently yield competitive results when compared to state-of-the-art waveform-based models yet require considerably fewer parameters.
翻译:在这项工作中,我们的目标是通过在高效端对端结构中建模顺序(时)和等级信息,提高基于波形的歧视性音乐网络的显性能力。我们介绍了MSLCAT,即多级和多级革命注意力变异器,这是一个用于直接从原始波形录音中学习复杂音乐标记强度的新型结构。我们还引入了MSLCAT称为 MuSLCAN的轻量级变异器,这是多级和多级革命关注网络的简称。MSLCAT和MSLCAN的模型特点都是从多个规模和级别上建模的。MusLCAT的后端-后端结构是不同的频率范围,同时使用两个具有关注度变异变的革命性网络来建模长距离依赖和多级的多级共振动注意力变变异器。MuslCAN的后端阵列和MSLCAR的后端阵列(MSLA-SL)的后端模式是其后端和后端结构的后端。MSLCAT的后端变版本,这是对MSLSLA-SLA-SL的后端网络的变版本, 和SLAWAWAV的后端变版本。M-SLMSLMSLM-SL的后置的后置的后端结构是用于SLM-SLM-SLM-SLM-SLM-SLM的校的变校的变式。M-SLM-SLM-SLM-SLM-SLM-SLM-SL的后端结构。M-SL的后置校。M。M的后置的后置版本,它的后置的后置版本,它的后置的后置的后置版本是用于的后置。M-SLM-SLM-SLM-SLM-SLM-SLM-SLM-SLM-SLM-SLM-SLM-SLM-SLA-SLM-SLM-S-S-S-SLM-C的后置版本,它的后置版本,它的后置版本,它的缩图图图