In this paper, we address the text-to-audio grounding issue, namely, grounding the segments of the sound event described by a natural language query in the untrimmed audio. This is a newly proposed but challenging audio-language task, since it requires to not only precisely localize all the on- and off-sets of the desired segments in the audio, but to perform comprehensive acoustic and linguistic understandings and reason the multimodal interactions between the audio and query. To tackle those problems, the existing method treats the query holistically as a single unit by a global query representation, which fails to highlight the keywords that contain rich semantics. Besides, this method has not fully exploited interactions between the query and audio. Moreover, since the audio and queries are arbitrary and variable in length, many meaningless parts of them are not filtered out in this method, which hinders the grounding of the desired segments. To this end, we propose a novel Query Graph with Cross-gating Attention (QGCA) model, which models the comprehensive relations between the words in query through a novel query graph. Besides, to capture the fine-grained interactions between audio and query, a cross-modal attention module that assigns higher weights to the keywords is introduced to generate the snippet-specific query representations. Finally, we also design a cross-gating module to emphasize the crucial parts as well as weaken the irrelevant ones in the audio and query. We extensively evaluate the proposed QGCA model on the public Audiogrounding dataset with significant improvements over several state-of-the-art methods. Moreover, further ablation study shows the consistent effectiveness of different modules in the proposed QGCA model.
翻译:在本文中,我们处理文字到音频的地面问题,即将自然语言查询所描述的音频活动的各个部分置于未经剪裁的音频中,这是一个新提议但具有挑战性的音频任务,因为它不仅需要精确地将音频中所有音频部分的上下套部分本地化,而且需要进行全面的声语和语言理解,并解释音频和查询之间的多式互动。为了解决这些问题,现有方法将查询整体地作为一个单一单位,由全球查询代表处来广泛处理,该代表处未能突出含有丰富音频语义的关键词。此外,这一方法没有充分利用查询和音频之间的相互作用。此外,由于音频和查询是任意的,因此许多无意义的部分没有在音频和调音频部分中被过滤出来,从而妨碍音频和语系与问询之间的基础互动。为此,我们建议用一个具有交叉式关注(QGCA)模型来模拟查询之间的全面关系,我们用新的查询图表来进一步模拟这些词之间的整体关系。此外,在更高级的音频模块中,A 将精确的音频- 将数据转换到质量到不断的判读取中,将数据模块到不断的音频到不断的音频-方向,将数据转换作为我们对调的判读。