This paper introduces effective design choices for text-to-music retrieval systems. An ideal text-based retrieval system would support various input queries such as pre-defined tags, unseen tags, and sentence-level descriptions. In reality, most previous works mainly focused on a single query type (tag or sentence) which may not generalize to another input type. Hence, we review recent text-based music retrieval systems using our proposed benchmark in two main aspects: input text representation and training objectives. Our findings enable a universal text-to-music retrieval system that achieves comparable retrieval performances in both tag- and sentence-level inputs. Furthermore, the proposed multimodal representation generalizes to 9 different downstream music classification tasks. We present the code and demo online.
翻译:本文件介绍了对文本到音乐检索系统的有效设计选择。理想的文本检索系统将支持各种输入查询,如预先定义的标签、看不见的标签和句级说明。实际上,以前的大部分工作主要集中于单一的查询类型(标签或句子),可能不会概括到其他输入类型。因此,我们审查最近基于文本的音乐检索系统,在两个主要方面使用我们拟议的基准:输入文本说明和培训目标。我们的调查结果使得一个通用的文本到音乐检索系统能够在标记和句级投入中取得可比的检索性能。此外,拟议的多式联运代表将9种不同的下游音乐分类任务概括为一般。我们在线介绍代码和演示。