Recently, massive architectures based on Convolutional Neural Network (CNN) and self-attention mechanisms have become necessary for audio classification. While these techniques are state-of-the-art, these works' effectiveness can only be guaranteed with huge computational costs and parameters, large amounts of data augmentation, transfer from large datasets and some other tricks. By utilizing the lightweight nature of audio, we propose an efficient network structure called Paired Inverse Pyramid Structure (PIP) and a network called Paired Inverse Pyramid Structure MLP Network (PIPMN). The PIPMN reaches 96\% of Environmental Sound Classification (ESC) accuracy on the UrbanSound8K dataset and 93.2\% of Music Genre Classification (MGC) on the GTAZN dataset, with only 1 million parameters. Both of the results are achieved without data augmentation or model transfer. Public code is available at: https://github.com/JNAIC/PIPMN
翻译:最近,基于进化神经网络(CNN)的大规模建筑和自我保护机制已成为音频分类的必要手段,虽然这些技术是最新技术,但这些工程的有效性只能以巨大的计算成本和参数、大量的数据增强、大型数据集的转移和其他一些把戏来保证。我们利用音频的轻量度性质,提议了一个称为“Paired Inverse Pyramid结构(PIP)”的高效网络结构和一个称为“Paired Inverse Pyramid结构MLP(PIPMN)”的网络。PIPMN在“UrbanSound8K”数据集和“GTAZN”数据集93.2“音乐基因分类(MGCC)”中达到96 ⁇ 环境稳妥分类(ESC)精确度,只有100万个参数。这两个结果都是在没有数据增强或模型传输的情况下取得的。公共代码见:https://github.com/JNAIC/PIPMN。