Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.
翻译:音乐信息往往通过多种数据模式传递或记录,包括但不限于音频、图像、文本和分数。然而,音乐信息检索研究几乎完全侧重于单一模式识别,需要为每种模式制定不同的模式模式。一些多模式工作需要将模型作为投入的多种共存模式,这些模式的使用将限制在从各种模式获得数据的少数情况下使用。据我们所知,没有任何现有模式能够接受不同模式的投入,例如图像或声音,并将其分类为统一的音乐类别。我们探索使用跨模式检索作为学习模式 -- -- 不可知性表现的借口任务,然后可以用作对独立于模式的分类师的投入。我们选择仪器分类作为我们研究的示例,因为视觉和音频组成部分都提供相关的语义信息。我们培训音乐工具分类师,既能将图像或声音作为投入,又能与只使用声音或图像的分类师比较。此外,我们探索在特定模式的标签数据有限的情况下使用跨模式进行跨模式检索,然后可以用作独立于模式的分类师资。我们选择工具分类方法进行最佳的操作。我们能够从其他模式的角度来进行详细分析。我们用标签分析,通过分析方式进行最佳的等级分析。我们能够实现最佳的系统,从零分析。我们提供最佳的系统进行最佳分析。