In music, short-term features such as pitch and tempo constitute long-term semantic features such as melody and narrative. A music genre classification (MGC) system should be able to analyze these features. In this research, we propose a novel framework that can extract and aggregate both short- and long-term features hierarchically. Our framework is based on ECAPA-TDNN, where all the layers that extract short-term features are affected by the layers that extract long-term features because of the back-propagation training. To prevent the distortion of short-term features, we devised the convolution channel separation technique that separates short-term features from long-term feature extraction paths. To extract more diverse features from our framework, we incorporated the frequency sub-bands aggregation method, which divides the input spectrogram along frequency bandwidths and processes each segment. We evaluated our framework using the Melon Playlist dataset which is a large-scale dataset containing 600 times more data than GTZAN which is a widely used dataset in MGC studies. As the result, our framework achieved 70.4% accuracy, which was improved by 16.9% compared to a conventional framework.
翻译:在音乐中,音道和节奏等短期特征构成诸如旋律和叙述等长期语义特征。音乐基因分类(MGC)系统应该能够分析这些特征。在这个研究中,我们提出了一个新的框架,可以抽取和汇总短期和长期的分层特征。我们的框架以ECAPA-TDNNN为基础,所有提取短期特征的层层都受到通过背面分析培训提取长期特征的层层的影响。为了防止短期特征的扭曲,我们设计了将短期特征与长期特征提取路径分离的卷发频道分离技术。为了从我们的框架中提取更多样化的特征,我们采用了频率子段集成法,将输入光谱与频率带宽和每个段的流程分开。我们用Melon Playlist数据集评估了我们的框架,该数据集包含600倍的数据,比GTZAN(MGC研究中广泛使用的数据集)多600倍。结果,我们的框架实现了70.4%的精确度,比常规框架提高了16.9%。