Recently, massive architectures based on Convolutional Neural Network (CNN) and self-attention mechanisms have become necessary for audio classification. While these techniques are state-of-the-art, these works' effectiveness can only be guaranteed with huge computational costs and parameters, large amounts of data augmentation, transfer from large datasets and some other tricks. By utilizing the lightweight nature of audio, we propose an efficient network structure called Paired Inverse Pyramid Structure (PIP) and a network called Paired Inverse Pyramid Structure MLP Network (PIPMN). The PIPMN reaches 96\% of Environmental Sound Classification (ESC) accuracy on the UrbanSound8K dataset and 93.2\% of Music Genre Classification (MGC) on the GTAZN dataset, with only 1 million parameters. Both of the results are achieved without data augmentation or model transfer. Public code is available at: https://github.com/JNAIC/PIPMN
翻译:近来,基于卷积神经网络(CNN)和自注意力机制的大规模架构成为音频分类必要的选择。虽然这些技术是最先进的,但是这些方法的有效性仅在巨大的计算成本和参数、大量的数据增强、从大型数据集的转移和其他一些技巧的帮助下才能保证。通过利用音频的轻量化特性,我们提出了一种有效的网络结构,称为Paired Inverse Pyramid Structure(PIP),以及一种称为Paired Inverse Pyramid Structure MLP Network(PIPMN)的网络。 PIPMN在UrbanSound8K数据集上达到了96%的环境声音分类(ESC)准确率,在GTAZN数据集上达到了93.2%的音乐类型分类(MGC),仅使用了100万个参数来实现。这两个结果都是在没有数据增强或模型转移的情况下实现的。公共代码可在以下网址找到:https://github.com/JNAIC/PIPMN