Timbre representations of musical instruments, essential for diverse applications such as musical audio synthesis and separation, might be learned as bottleneck features from an instrumental recognition model. Given the similarities between speaker recognition and musical instrument recognition, in this paper, we investigate how to adapt successful speaker recognition algorithms to musical instrument recognition to learn meaningful instrumental timbre representations. To address the mismatch between musical audio and models devised for speech, we introduce a group of trainable filters to generate proper acoustic features from input raw waveforms, making it easier for a model to be optimized in an input-agnostic and end-to-end manner. Through experiments on both the NSynth and RWC databases in both musical instrument closed-set identification and open-set verification scenarios, the modified speaker recognition model was capable of generating discriminative embeddings for instrument and instrument-family identities. We further conducted extensive experiments to characterize the encoded information in learned timbre embeddings.
翻译:音乐乐器是音乐音频合成和分离等多种应用所必不可少的,因此,可以作为瓶颈特征从一种工具的识别模型中学习。鉴于语音识别和乐器识别之间的相似性,我们在本文件中调查如何使成功的语音识别算法适应乐器识别,以学习有意义的乐器识别;为解决音乐音频和为演讲设计的模型之间的不匹配问题,我们引入一组可训练过滤器,从输入的原始波形中产生适当的声学特征,使一种模型更容易以输入-认知和终端至终端的方式优化。通过对NSynth和RWC数据库的实验,在乐器封闭式识别和开放式核查情景中,经过修改的语音识别模型能够产生对乐器和乐器家庭特性的歧视性嵌入。我们进一步进行了广泛的实验,将编码信息定性为有知识的丁膜嵌入式嵌入式。