In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text. In this paper, we present a text-aware attention pooling (TAP) module for TAR, which is essentially a scaled dot product attention for a text to attend to its most semantically similar frames. Furthermore, previous methods only conduct the softmax for every single-side retrieval, ignoring the potential cross-retrieval information. By exploring the intrinsic prior of each text-audio pair, we introduce a prior matrix revised (PMR) loss to filter the hard case with high (or low) text-to-audio but low (or high) audio-to-text similarity scores, thus achieving the dual optimal match. Experiments show that our TAP significantly outperforms various text-agnostic pooling functions. Moreover, our PMR loss also shows stable performance gains on multiple datasets.
翻译:在文本-音频检索(TAR)任务中,由于文本和音频之间内容的多样性,文本中所含语义信息与音频中的某些框架类似。然而,现有的计算方法在不考虑文本的情况下将整个音频聚合在一起,例如平均集合在框架上,这可能将特定文本中未描述的误导音频信息编码起来。在本文中,我们为TAR提出了一个文本认知集中模块(TAP),该模块基本上是一个用于处理其最语义相似框架的文本的缩放点产品关注点。此外,以往的方法只对每个单面检索进行软模,忽略潜在的交叉检索信息。通过探索每个文本-音频配对的内在前端,我们引入了先前经过修订的矩阵(PMR)损失,以高(或低)文本到音频-文字的低(或高)音频-文字相似的分数来过滤硬体,从而实现双重最佳匹配。实验显示,我们的TAP显著地超越了各种文本-敏感集合功能,忽略潜在的交叉检索信息。此外,我们的数据损耗损率也保持稳定。</s>