Large scale databases with high-quality manual annotations are scarce in audio domain. We thus explore a self-supervised graph approach to learning audio representations from highly limited labelled data. Considering each audio sample as a graph node, we propose a subgraph-based framework with novel self-supervision tasks that can learn effective audio representations. During training, subgraphs are constructed by sampling the entire pool of available training data to exploit the relationship between the labelled and unlabeled audio samples. During inference, we use random edges to alleviate the overhead of graph construction. We evaluate our model on three benchmark audio databases, and two tasks: acoustic event detection and speech emotion recognition. Our semi-supervised model performs better or on par with fully supervised models and outperforms several competitive existing models. Our model is compact (240k parameters), and can produce generalized audio representations that are robust to different types of signal noise.
翻译:具有高质量人工注释的大型数据库在音频域中是稀缺的。 因此,我们探索了一种自我监督的图表方法,从高度有限的标签数据中学习音频表现。 考虑到每个音频样本是一个图形节点,我们提议了一个基于子图的框架,其中含有新的自我监督任务,能够学习有效的音频表现。在培训期间,分图是通过对整个现有培训数据库进行取样来建立,以利用标签和无标签音频样本之间的关系。在推断期间,我们使用随机边缘来减轻图形构造的间接费用。我们评估了三个基准音频数据库的模型,以及两项任务:声学事件探测和语音感应识别。我们的半监督模型运行得更好,或与完全监督的模式相匹配,并且优于现有的几种竞争性模型。我们的模型是紧凑的(240k参数),可以产生对不同类型信号噪音具有强大力的通用音频表示。