Automatic music transcription (AMT) aims to convert raw audio to symbolic music representation. As a fundamental problem of music information retrieval (MIR), AMT is considered a difficult task even for trained human experts due to overlap of multiple harmonics in the acoustic signal. On the other hand, speech recognition, as one of the most popular tasks in natural language processing, aims to translate human spoken language to texts. Based on the similar nature of AMT and speech recognition (as they both deal with tasks of translating audio signal to symbolic encoding), this paper investigated whether a generic neural network architecture could possibly work on both tasks. In this paper, we introduced our new neural network architecture built on top of the current state-of-the-art Onsets and Frames, and compared the performances of its multiple variations on AMT task. We also tested our architecture with the task of speech recognition. For AMT, our models were able to produce better results compared to the model trained using the state-of-art architecture; however, although similar architecture was able to be trained on the speech recognition task, it did not generate very ideal result compared to other task-specific models.
翻译:自动音乐转录(AMT)旨在将原始音频转换为象征性的音乐表述。作为音乐信息检索(MIR)的一个基本问题,AMT被视为一项艰巨的任务,即使是经过训练的人类专家也因为声波信号中多重调和重叠而感到困难。另一方面,语音识别作为自然语言处理中最受欢迎的任务之一,目的是将人口语言转换为文本。基于AMT和语音识别(两者都涉及将音频信号转换为象征性编码的任务)的类似性质,本文调查了通用神经网络架构能否在这两项任务上发挥作用。在本文中,我们引入了在目前最先进的Onsets和框架顶端上建造的新神经网络架构,并比较了其在AMT任务上的多种变异性表现。我们还用语音识别任务测试了我们的架构。对于AMT来说,我们的模型能够产生更好的结果,而与使用最新结构培训的模型相比;然而,尽管类似的结构能够就语音识别任务进行培训,但与其他任务模型相比,它并没有产生非常理想的结果。</s>