Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets -- with substantial gains in octave and tone accuracy.
翻译:在音乐信息检索领域,歌唱旋律提取是一个重要问题。 现有方法通常依靠频率域表示法来估计声频。 但是, 这个设计并不导致对音调( pitch- class) 和 八进制的旋律信息感知的人类水平性能。 在本文中, 我们提议TONNet, 这是一种插座和播放模式, 通过利用新颖的投入表示法和新颖的网络结构来改善音调和八进制感知。 首先, 我们展示了一个改进的输入代表法, 即 Tone- CFP, 通过重新排列频率键来明确组合调音。 第二, 我们引入一个编码脱coder- decoder 结构, 目的是获得突出特征图、 音调特征图和八进制特征图。 第三, 我们提出一个调子- ocve 组合机制, 以改善最后的突出特征图。 我们做了实验, 以各种基线主干模型来验证网络的能力。 我们的结果表明, 与Tone- CFP 的音调- 聚合能显著地改进不同数据基调的语音提取性。