In audio classification, differentiable auditory filterbanks with few parameters cover the middle ground between hard-coded spectrograms and raw audio. LEAF (arXiv:2101.08596), a Gabor-based filterbank combined with Per-Channel Energy Normalization (PCEN), has shown promising results, but is computationally expensive. With inhomogeneous convolution kernel sizes and strides, and by replacing PCEN with better parallelizable operations, we can reach similar results more efficiently. In experiments on six audio classification tasks, our frontend matches the accuracy of LEAF at 3% of the cost, but both fail to consistently outperform a fixed mel filterbank. The quest for learnable audio frontends is not solved.
翻译:在音频分类方面,具有少数参数的不同听觉过滤库覆盖了硬编码光谱和原始音频之间的中间地带。 以加博为基础的过滤库(arXiv:2101.08596),加博为主的过滤库,加上中央能源正常化(PCEN ), 已经显示出有希望的结果,但计算成本很高。 以不相容的共振内核大小和进步,并通过以更平行的操作取代PCEN,我们可以更有效地取得类似的结果。 在六个音频分类任务的实验中,我们的前端与LEAF的准确性相匹配,成本的3%,但两者都未能始终超过固定的Mel过滤库。 寻找可学习的音频前端的努力没有解决。