Many tasks in music information retrieval (MIR) involve weakly aligned data, where exact temporal correspondences are unknown. The connectionist temporal classification (CTC) loss is a standard technique to learn feature representations based on weakly aligned training data. However, CTC is limited to discrete-valued target sequences and can be difficult to extend to multi-label problems. In this article, we show how soft dynamic time warping (SoftDTW), a differentiable variant of classical DTW, can be used as an alternative to CTC. Using multi-pitch estimation as an example scenario, we show that SoftDTW yields results on par with a state-of-the-art multi-label extension of CTC. In addition to being more elegant in terms of its algorithmic formulation, SoftDTW naturally extends to real-valued target sequences.
翻译:音乐信息检索中的许多任务涉及到弱对齐数据,即确切的时间对应关系是未知的。连接主义时间分类(CTC)损失是一种基于弱对齐训练数据学习特征表示的标准技术。然而,CTC仅限于离散值目标序列,并且在多标签问题上很难扩展。在本文中,我们展示了如何将软动态时间规整(SoftDTW),一种经典DTW的可微变体,用作CTC的替代方法。以多音高估计作为示例场景,我们展示了SoftDTW产生的结果与CTC的最新多标签扩展相当。除了在算法形式上更加优美外,SoftDTW自然地扩展到实值目标序列。