We present Music Tagging Transformer that is trained with a semi-supervised approach. The proposed model captures local acoustic characteristics in shallow convolutional layers, then temporally summarizes the sequence of the extracted features using stacked self-attention layers. Through a careful model assessment, we first show that the proposed architecture outperforms the previous state-of-the-art music tagging models that are based on convolutional neural networks under a supervised scheme. The Music Tagging Transformer is further improved by noisy student training, a semi-supervised approach that leverages both labeled and unlabeled data combined with data augmentation. To our best knowledge, this is the first attempt to utilize the entire audio of the million song dataset.
翻译:我们展示了经过半监督方法培训的音乐拖网变形器。 提议的模型捕捉了浅层进化层的当地声学特征, 然后用堆叠的自我注意层的时间总结了提取的特征的顺序。 我们首先通过仔细的模型评估, 我们首先显示, 提议的架构比以前基于在受监督的神经神经网络的进化神经网络的最先进的音乐标记模型表现得更好。 音乐拖网变形器通过吵闹的学生培训得到进一步的改进, 这是一种半监督的方法, 利用标签和未标记的数据与数据增强相结合。 据我们所知, 这是第一次尝试利用百万个歌曲数据集的全部音频。