使用自然语言查询的音频检索 (Audio Retrieval with Natural Language Queries)

We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce challenging new benchmarks for text-based audio retrieval using text annotations sourced from the Audiocaps and Clotho datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into cross-modal text-based audio retrieval with free-form text queries.

翻译：我们考虑利用自由形式自然语言查询检索音频的任务,研究这个问题,因为这个问题在现有文献中受到的关注有限,我们采用基于文本的音频检索新基准,使用来自音频卡普和克洛托数据集的文字说明。然后,我们利用这些基准来建立跨模式音频检索基线,在那里我们展示各种音频任务的培训前培训的好处。我们希望我们的基准将激励进一步研究基于文本的跨模式音频检索,同时进行自由格式的文字查询。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。