Self-supervision methods learn representations by solving pretext tasks that do not require human-generated labels, alleviating the need for time-consuming annotations. These methods have been applied in computer vision, natural language processing, environmental sound analysis, and recently in music information retrieval, e.g. for pitch estimation. Particularly in the context of music, there are few insights about the fragility of these models regarding different distributions of data, and how they could be mitigated. In this paper, we explore these questions by dissecting a self-supervised model for pitch estimation adapted for tempo estimation via rigorous experimentation with synthetic data. Specifically, we study the relationship between the input representation and data distribution for self-supervised tempo estimation.
翻译:自监督方法通过解决不需要人工标签的先兆任务来学习表示,减轻了耗时的注释需求。这些方法已经应用于计算机视觉、自然语言处理、环境声音分析以及最近的音乐信息检索中,例如用于音高估计。特别是在音乐领域,对于这些模型关于不同数据分布的脆弱性以及如何减缓这种影响,目前尚缺乏足够的了解。在这篇论文中,我们通过对合成数据的严格实验来分析自监督模型适用于节拍估计的音高估计模型的解剖。具体来说,我们研究了输入表示和数据分布对于自监督节拍估计的关系。