In music and speech, meaning is derived at multiple levels of context. Affect, for example, can be inferred both by a short sound token and by sonic patterns over a longer temporal window such as an entire recording. In this paper we focus on inferring meaning from this dichotomy of contexts. We show how contextual representations of short sung vocal lines can be implicitly learned from fundamental frequency ($F_0$) and thus be used as a meaningful feature space for downstream Music Information Retrieval (MIR) tasks. We propose three self-supervised deep learning paradigms which leverage pseudotask learning of these two levels of context to produce latent representation spaces. We evaluate the usefulness of these representations by embedding unseen vocal contours into each space and conducting downstream classification tasks. Our results show that contextual representation can enhance downstream classification by as much as 15 % as compared to using traditional statistical contour features.
翻译:在音乐和演讲中,含义来自多个层面的上下文。例如,效果可以通过一个简短的音符和声学模式在较长的时间窗口(如整个录音)中推导出来。在本文中,我们侧重于从这种背景的二分法中推断含义。我们展示了如何从基本频率(F_0美元)中隐含地学习短声线的背景表达方式,从而可以用作下游音乐信息检索(MIR)任务的一个有意义的特征空间。我们提出了三个自我监督的深层次学习模式,利用这两个背景的假象学习来产生潜在代表空间。我们通过将隐蔽的声波轮嵌嵌入每个空间和进行下游分类任务来评估这些表达方式的效用。我们的结果显示,与使用传统的统计轮廓特征相比,背景代表可以提高下游分类的比例高达15%。