In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results. The instrument conditioning is designed for an explicit multi-instrument functionality while the connection between the transcription and source separation modules is for better transcription performance. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. However, its novelty necessitates a new perspective on how to evaluate such a model. During the experiment, we assess the model from various aspects, providing a new evaluation perspective for multi-instrument transcription. We also argue that transcription models can be utilized as a preprocessing module for other music analysis tasks. In the experiment on several downstream tasks, the symbolic representation provided by our transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.
翻译:在本文中,我们引入了 " 联合 ",这是一个具有仪器意识的多仪器框架,能够将多种乐器与音频剪辑、识别和分离。 " 联合 " 包括使其他模块具备条件的仪器识别模块:产生仪器特定钢琴卷的转录模块,以及使用仪器信息和转录结果的源分离模块。 " 仪器调节为明确的多仪器功能设计,而转录和源分离模块之间的连接则是更好的转录性能。 " 我们的棘手问题配方使得现代流行音乐通常由多种乐器组成,因此在现实世界中的模型非常有用。然而,由于现代流行音乐通常由多种乐器组成,因此它的新颖性要求从新角度看待如何评价这种模型。在实验期间,我们从多个方面评估模型,为多仪器转录录提供了新的评价视角。我们还认为,抄录模型可以用作其他音乐分析任务的预处理模块。在几个下游任务的实验中,我们转录模型提供的象征性表述方式对光谱分析很有帮助。