The recent progress in text-based audio retrieval was largely propelled by the release of suitable datasets. Since the manual creation of such datasets is a laborious task, obtaining data from online resources can be a cheap solution to create large-scale datasets. We study the recently proposed SoundDesc benchmark dataset, which was automatically sourced from the BBC Sound Effects web page. In our analysis, we find that SoundDesc contains several duplicates that cause leakage of training data to the evaluation data. This data leakage ultimately leads to overly optimistic retrieval performance estimates in previous benchmarks. We propose new training, validation, and testing splits for the dataset that we make available online. To avoid weak contamination of the test data, we pool audio files that share similar recording setups. In our experiments, we find that the new splits serve as a more challenging benchmark.
翻译:最近基于文本的音频检索进展在很大程度上是由适当数据集的发布推动的。由于手工创建这类数据集是一项艰巨的任务,从在线资源获取数据可以成为创建大规模数据集的廉价解决方案。我们研究了最近提出的SoundDesc基准数据集,该数据库自动来源于BBC声音效果网页。我们的分析发现,SoundDesc包含若干重复内容,导致培训数据渗漏到评价数据中。这一数据渗漏最终导致在以前的基准中得出过于乐观的检索性业绩估计。我们提议为在线提供的数据集进行新的培训、验证和测试。为了避免测试数据受到薄弱的污染,我们汇集了共享类似记录设置的音频文件。在我们的实验中,我们发现这些新分割作为更具挑战性的基准。</s>