Supervised learning methods can solve the given problem in the presence of a large set of labeled data. However, the acquisition of a dataset covering all the target classes typically requires manual labeling which is expensive and time-consuming. Zero-shot learning models are capable of classifying the unseen concepts by utilizing their semantic information. The present study introduces image embeddings as side information on zero-shot audio classification by using a nonlinear acoustic-semantic projection. We extract the semantic image representations from the Open Images dataset and evaluate the performance of the models on an audio subset of AudioSet using semantic information in different domains; image, audio, and textual. We demonstrate that the image embeddings can be used as semantic information to perform zero-shot audio classification. The experimental results show that the image and textual embeddings display similar performance both individually and together. We additionally calculate the semantic acoustic embeddings from the test samples to provide an upper limit to the performance. The results show that the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddings can reach up to the semantic acoustic embeddings when the seen and unseen classes are semantically similar.
翻译:受监督的学习方法可以在存在大量标签数据的情况下解决给定的问题。 然而, 获取覆盖所有目标类的数据集通常需要人工标签, 成本昂贵且耗时。 零点学习模型能够使用语义信息对未知概念进行分类。 本研究采用非线性声学- 语义学投影, 将图像嵌入作为零点音频分类的侧边信息。 我们从开放图像数据集中提取语义图像显示, 并评估AudioSet音频子组的性能, 使用不同域的语义信息; 图像、 音频和文本信息。 我们证明图像嵌入可用作语义信息, 以进行零点音频音频分类。 实验结果显示, 图像和文字嵌入在单个和文本嵌入中都表现相似。 我们从测试样品中进一步计算语义性声学嵌入, 以提供性能的上限。 结果显示, 分类性能对于测试、 培训类和文本嵌入阶段之间的语义关系非常敏感, 和感官和图像嵌入时, 可以到达Syal- 嵌入。