We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text. Shallow neural networks map the embeddings to a common dimensionality. Our system, which is an extension of our submission to the Language-based Audio Retrieval Task of the DCASE Challenge 2022, employs the RoBERTa foundation model as the text embedding extractor. A pretrained PANNs model extracts the audio embeddings. To improve the generalisation of our model, we investigate how pretraining with audio and associated noisy text collected from the online platform Freesound improves the performance of our method. Furthermore, our ablation study reveals that the proper choice of the loss function and fine-tuning the pretrained models are essential in training a competitive retrieval system.
翻译:我们分析了用于跨模式(文字到文字)检索的大规模预先培训的深层次学习模型。 我们用这些模型所提取的嵌入模型将音频和文字的对对匹配连接起来。 浅色神经网络将嵌入图绘制成一个共同的维度。 我们的系统是提交DCASE 挑战 2022 语言的音频检索任务文件的延伸, 使用RoBERTA 基础模型作为文本嵌入提取器。 一个经过预先培训的PANN 模型提取了音频嵌入模型。 为了改进模型的概括化, 我们调查如何预先训练从在线平台免费声音收集的音频和相关噪音文本, 提高我们的方法性能。 此外, 我们的减法研究表明,适当选择损失功能和微调预先培训模型对于培训竞争性检索系统至关重要。