Machine learning techniques have proved useful for classifying and analyzing audio content. However, recent methods typically rely on abstract and high-dimensional representations that are difficult to interpret. Inspired by transformation-invariant approaches developed for image and 3D data, we propose an audio identification model based on learnable spectral prototypes. Equipped with dedicated transformation networks, these prototypes can be used to cluster and classify input audio samples from large collections of sounds. Our model can be trained with or without supervision and reaches state-of-the-art results for speaker and instrument identification, while remaining easily interpretable. The code is available at: https://github.com/romainloiseau/a-model-you-can-hear
翻译:事实证明,机器学习技术在对音频内容进行分类和分析方面是有用的,然而,最近的方法通常依赖难以解释的抽象和高维的表达方式。在为图像和3D数据开发的变异-变异方法的启发下,我们提议了一个以可学习的光谱原型为基础的音频识别模型。这些原型可以使用专门的变异网络,对大量声音收集的输入音频样本进行分组和分类。我们的模型可以接受监督或不进行监督的培训,并达到语音和仪器识别的最新结果,同时仍然易于解释。代码可以在https://github.com/romamainloiseau/a-model-yo-can-hear上查到。