The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the \datasetName dataset will be made publicly available.
翻译:这项工作的目标是跨模版文本-音频和音频-文字检索,目标是从最符合某一书面描述的候选人库中检索音频内容,反之亦然。文本-音频检索使用户能够通过直观界面搜索大型数据库:他们只是发布关于他们喜欢听到的声音的免费自然语言描述;研究文本-音频和音频-文字检索的任务,这些任务在现有文献中受到的关注有限,我们引入了三个具有挑战性的新基准。我们首先从音频和克洛托录音字幕数据集中建立文字-音频和音频-文字检索基准。此外,我们引入了音频-音频-文字检索基准,其中包括配对音频和自然语言描述,以收集各种声音,补充音频-音频和克洛托音频中的声音。我们使用这三个基准来建立跨调文本-音频和文字检索的基线,在那里我们展示了各种音频任务培训的好处。我们希望我们的基准将激励进一步研究,以自由形式文本查询的方式进行音频检索。使用的所有数据设置、音频特征。