The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the SoundDescs dataset are publicly available at https://github.com/akoepke/audio-retrieval-benchmark.
翻译:这项工作的目标是跨模版文本-音频和音频-文字检索,目标是从最符合特定书面描述的候选人库中检索音频内容,反之亦然。文本-音频检索使用户能够通过直观界面搜索大型数据库:他们只是发布关于他们喜欢听到的声音的免费自然语言描述;研究文本-音频和音频-文字检索的任务,这些任务在现有文献中受到的关注有限,我们引入了三个具有挑战性的新基准。我们首先从音频和克洛托录音字幕数据集中建立文字-音频和音频-文字检索基准。此外,我们引入了音频Descys基准,其中包括配对音频和自然语言描述,以收集各种声音,这些声音可以补充音频卡和克洛托的音频。我们使用这三个基准来建立跨调文本-音频和文字检索的基线,在这些基准中我们展示了各种音频任务的培训前的好处。我们希望我们的基准将激励进一步研究,以免费文本/克字幕查询的方式进行音频-文字检索。 数据库用于/公开数据。