Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning. To this end, we compare the very recent patchout spectrogram transformer with two classic convolutional architectures. We evaluate these three architectures on three tasks and on three different benchmark datasets: general-purpose tagging on AudioSet, environmental sound classification on ESC-50, and instrument tagging on OpenMIC. Our results show that the self-attention-based embedding methods outperform both compared convolutional architectures in all of these settings. By designing training and test data accordingly, we observe that prediction performance suffers significantly when the `semantic distance' between training and new test classes is large, an effect that will deserve more detailed investigations.
翻译:用于标记和分类音频信号的标准机器学习模式无法处理培训期间没有看到的课程。 Zero-Shot (ZS) 学习通过根据可调整的班级描述预测班级来克服这一限制。 本研究旨在调查ZS 学习的基于自我注意的音频嵌入结构的有效性。 为此,我们将最近的补丁光谱变压器与两个经典的共变结构进行比较。 我们评估了这三套结构的三项任务和三个不同的基准数据集:AudioSet上的通用标记、ESC-50的无害环境分类和OpenMIC上的仪器标记。 我们的结果表明,在所有这些环境中,基于自我注意的嵌入方法都比进化结构都好。 通过设计培训和测试数据,我们观察到,在培训与新测试班之间的“静态距离”很大时,预测业绩会受到极大影响,这种效果值得更详细调查。