Automatic Music Transcription (AMT) is a crucial technology in music information processing. Despite recent improvements in performance through machine learning approaches, existing methods often achieve high accuracy in domains with abundant annotation data, primarily due to the difficulty of creating annotation data. A practical transcription model requires an architecture that does not require an annotation data. In this paper, we propose an annotation-free transcription model achieved through the utilization of scalable synthetic audio for pre-training and adversarial domain confusion using unannotated real audio. Through evaluation experiments, we confirm that our proposed method can achieve higher accuracy under annotation-free conditions compared to when learning with mixture of annotated real audio data. Additionally, through ablation studies, we gain insights into the scalability of this approach and the challenges that lie ahead in the field of AMT research.
翻译:暂无翻译