Event detection improves when events are captured by two different modalities rather than just one. But to train detection systems on multiple modalities is challenging, in particular when there is abundance of unlabelled data but limited amounts of labeled data. We develop a novel self-supervised learning technique for multi-modal data that learns (hidden) correlations between simultaneously recorded microphone (sound) signals and accelerometer (body vibration) signals. The key objective of this work is to learn useful embeddings associated with high performance in downstream event detection tasks when labeled data is scarce and the audio events of interest (songbird vocalizations) are sparse. We base our approach on deep canonical correlation analysis (DCCA) that suffers from event sparseness. We overcome the sparseness of positive labels by first learning a data sampling model from the labelled data and by applying DCCA on the output it produces. This method that we term balanced DCCA (b-DCCA) improves the performance of the unsupervised embeddings on the downstream supervised audio detection task compared to classsical DCCA. Because data labels are frequently imbalanced, our method might be of broad utility in low-resource scenarios.
翻译:当事件以两种不同的方式而不是仅仅一种方式捕捉到时,发现事件的探测会得到改善。但是,在多种模式上培训探测系统是具有挑战性的,特别是在有大量未贴标签的数据但标签数据数量有限的情况下。我们开发了一种新型的多模式数据自监督学习技术,以学习同时录制的麦克风(声音)信号和加速仪(体震动)信号之间的(隐藏的)关联。这项工作的关键目标是在标签数据稀少,而音频事件(鸟声响)少时,在下游事件探测任务中学习与高性能有关的有用嵌入。我们的方法基于受事件稀少影响的深孔相交分析(DCCA)为基础。我们首先从标签数据中学习数据取样模型,然后将DCCA(B-DCCA)应用到它产生的输出中,从而克服了正面标签的稀少性。我们用来平衡DCCA(B-DCA)的方法可以改进下游监听音探测任务上不严密的嵌入性功能,而听觉听觉的音频探测任务(Songbbbbirls vocalations) 。因为数据标签常常处于低的用途。我们的方法可能会。