Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two modalities. Existing cross-modal retrieval models are mostly optimized by metric learning objectives as both of them attempt to map data to an embedding space, where similar data are close together and dissimilar data are far apart. Unlike other cross-modal retrieval tasks such as image-text and video-text retrievals, audio-text retrieval is still an unexplored task. In this work, we aim to study the impact of different metric learning objectives on the audio-text retrieval task. We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets. We demonstrate that NT-Xent loss adapted from self-supervised learning shows stable performance across different datasets and training settings, and outperforms the popular triplet-based losses. Our code is available at https://github.com/XinhaoMei/audio-text_retrieval.
翻译:音频-文字检索主要通过衡量学习目标来优化现有的跨模式检索模型,因为这两个模型都试图将数据映射到嵌入空间,相近的数据相近,数据差异很大。不同于图像-文字和视频-文字检索等其他交叉模式检索任务,音频-文字检索仍是一项尚未探讨的任务。在这项工作中,我们的目标是研究不同指标学习目标对音频-文字检索任务的影响。我们广泛评价音频卡和Clootho数据集的流行指标学习目标。我们证明,从自我监督学习到自我监督学习的NT-X内容损失显示不同数据集和培训环境的稳定性,并超越流行的三重文字损失。我们的代码可在 https://githubrie.com/Chootho数据集上查阅。我们证明,从自我监督学习中改编成的NT-X内容损失显示不同数据集和培训环境的稳定性,并显示流行的三重损失。我们的代码可在 https://githbrievrie.Ho-Meindio/Mexivio。