Automatic Music Transcription (AMT) converts audio recordings into symbolic musical representations. Training deep neural networks (DNNs) for AMT typically requires strongly aligned training pairs with precise frame-level annotations. Since creating such datasets is costly and impractical for many musical contexts, weakly aligned approaches using segment-level annotations have gained traction. However, existing methods often rely on Dynamic Time Warping (DTW) or soft alignment loss functions, both of which still require local semantic correspondences, making them error-prone and computationally expensive. In this article, we introduce CountEM, a novel AMT framework that eliminates the need for explicit local alignment by leveraging note event histograms as supervision, enabling lighter computations and greater flexibility. Using an Expectation-Maximization (EM) approach, CountEM iteratively refines predictions based solely on note occurrence counts, significantly reducing annotation efforts while maintaining high transcription accuracy. Experiments on piano, guitar, and multi-instrument datasets demonstrate that CountEM matches or surpasses existing weakly supervised methods, improving AMT's robustness, scalability, and efficiency. Our project page is available at https://yoni-yaffe.github.io/count-the-notes.
翻译:自动音乐转录(AMT)将音频录音转换为符号化的音乐表示。训练用于AMT的深度神经网络(DNN)通常需要具有精确帧级标注的强对齐训练对。由于在许多音乐场景中创建此类数据集成本高昂且不切实际,使用片段级标注的弱对齐方法逐渐受到关注。然而,现有方法通常依赖于动态时间规整(DTW)或软对齐损失函数,这两种方法仍需要局部语义对应,导致其易出错且计算开销大。本文提出CountEM,一种新颖的AMT框架,通过利用音符事件直方图作为监督,消除了显式局部对齐的需求,实现了更轻量的计算和更高的灵活性。采用期望最大化(EM)方法,CountEM仅基于音符出现次数迭代优化预测,在保持高转录精度的同时显著减少了标注工作量。在钢琴、吉他和多乐器数据集上的实验表明,CountEM达到或超越了现有弱监督方法,提升了AMT的鲁棒性、可扩展性和效率。项目页面详见:https://yoni-yaffe.github.io/count-the-notes。