In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text. In this paper, we present a text-aware attention pooling (TAP) module for TAR, which is essentially a scaled dot product attention for a text to attend to its most semantically similar frames. Furthermore, previous methods only conduct the softmax for every single-side retrieval, ignoring the potential cross-retrieval information. By exploring the intrinsic prior of each text-audio pair, we introduce a prior matrix revised (PMR) loss to filter the hard case with high (or low) text-to-audio but low (or high) audio-to-text similarity scores, thus achieving the dual optimal match. Experiments show that our TAP significantly outperforms various text-agnostic pooling functions. Moreover, our PMR loss also shows stable performance gains on multiple datasets.
翻译:在文本-音频检索(TAR)任务中,由于文本和音频之间的内容异质性,文本中包含的语义信息仅与音频中的某些帧相似。然而,现有的工作对整个音频进行聚合而不考虑文本,比如对帧进行均值池化,这很可能会编码不在给定文本中描述的具有误导性的音频信息。在本文中,我们为TAR提出了一种文本感知注意力池化(TAP)模块,它实质上是一种缩放点积关注,文本会注意到与其最相关的帧。此外,以前的方法仅对单边检索进行softmax,忽略了潜在的交叉检索信息。通过探索每个文本-音频对的内在先验,我们引入了先验矩阵修正(PMR)损失来过滤高(或低)文本-音频但低(或高)音频-文本相似度得分的困难情况,从而实现双重最优匹配。实验表明,我们的TAP显著优于各种文本无关的池化函数。此外,我们的PMR损失还显示出多个数据集的稳定性能增益。