The absence of large labeled datasets remains a significant challenge in many application areas of deep learning. Researchers and practitioners typically resort to transfer learning and data augmentation to alleviate this issue. We study these strategies in the context of audio retrieval with natural language queries (Task 6b of the DCASE 2022 Challenge). Our proposed system uses pre-trained embedding models to project recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. We employ various data augmentation techniques on audio and text inputs and systematically tune their corresponding hyperparameters with sequential model-based optimization. Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance. We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.
翻译:在许多深层学习的应用领域,缺少大型标签数据集仍然是一大挑战。研究人员和从业人员通常采用转移学习和数据增强的方法来缓解这一问题。我们在自然语言查询(DCASE 2022挑战,任务6b)的音频检索背景下研究这些战略。我们提议的系统使用预先培训的嵌入模型,将项目录音和文字描述嵌入一个共享的音频插播空间,其中不同模式的相关实例接近。我们在音频和文字投入方面采用各种数据增强技术,并系统地以顺序模式优化调整相应的超强参数。我们的结果显示,所使用的扩增战略减少了超配,提高了检索性能。我们进一步表明,在音频卡数据集上对系统进行预先培训可以带来进一步的改进。