联合:通过联合培训,同时改善多种仪器传输和音乐源分离 (Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training)

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist. Demo available at \url{https://jointist.github.io/Demo}.

翻译：在本文中,我们引入了 " 联合 ",这是一个能转换、识别和区分多种乐器和音频剪辑的具有仪器觉悟的多工具的多工具工具框架。 " 联合 " 包括一个仪器识别模块,该模块是其他两个模块的条件:一个可输出仪器专用钢琴卷的转录模块,以及一个使用仪器信息和转录结果的源分离模块。对转录和源分离模块的联合培训,有助于改进这两项任务的业绩。仪器模块是可选的,可以直接由人类用户控制。这使得 " 联合 " 是一个灵活的用户可控制页数框架。我们棘手的问题配方使这个模型在现实世界中非常有用,因为现代流行音乐通常由多种仪器组成。然而,它的新颖性要求有一个如何评价这种模型的新视角。我们在实验中,从各方面评估拟议的模型,为多工具转换提供了新的评价视角。我们的主观倾听研究表明, " 联合 " 联合 " 实现流行音乐的状态-艺术下演算,优于现有的多工具读作模型,例如MT3。我们在“联合调查”中,我们通过若干次的分级任务和D级评分解方法进行了实验。