Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on eight diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.
翻译:Mel-filterbanks是固定的、工程化的音频功能,可以模仿人类的感知,并一直通过历史的听觉理解来使用。然而,它们的不可否认的特性被手工制作的演示的基本局限性所抵消。在这项工作中,我们证明我们可以在广泛的音频信号(包括语音、音乐、音频事件和动物声音)上训练一个优于Mel-filterbanks的可学习前端的单一可学习的前端,该前端可以模仿人类的感知,而且直到今天为止一直使用。为了这样做,我们引入了一个新的有原则的、轻巧的、完全可学习的建筑,可以用来取代Mel-filterbanks。我们的系统可以学习从过滤到集合、压缩和正常化的所有音频特征提取操作,并且可以以微不足道的参数成本融入任何神经网络。我们在8项不同的音频分类任务上进行多任务培训,并显示我们的模范在Mel-filterbanks和以前可学习的替代物上不断改进。此外,我们的系统比音频设置现有最先进的可学习的前端端端端端的参数要小得多。