We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce challenging new benchmarks for text-based audio retrieval using text annotations sourced from the Audiocaps and Clotho datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into cross-modal text-based audio retrieval with free-form text queries.
翻译:我们考虑利用自由形式自然语言查询检索音频的任务,研究这个问题,因为这个问题在现有文献中受到的关注有限,我们采用基于文本的音频检索新基准,使用来自音频卡普和克洛托数据集的文字说明。然后,我们利用这些基准来建立跨模式音频检索基线,在那里我们展示各种音频任务的培训前培训的好处。我们希望我们的基准将激励进一步研究基于文本的跨模式音频检索,同时进行自由格式的文字查询。