In this paper, we target at the text-to-audio grounding issue, namely, grounding the segments of the sound event described by a natural language query in the untrimmed audio. This is a newly proposed but challenging audio-language task, since it requires to not only precisely localize all the on- and off-sets of the desired segments in the audio, but also perform comprehensive acoustic and linguistic understandings and reason the multimodal interactions between the audio and query. To tackle those problems, the existing methods often holistically treat the query as a single unit by a global query representation. We argue that this approach suffers from several limitations. Motivated by the above considerations, we propose a novel Cross-modal Graph Interaction (CGI) model, which comprehensively models the comprehensive relations between the words in a query through a novel language graph. To capture the fine-grained interactions between the audio and query, a cross-modal attention module is introduced to assign higher weights to the keywords with more important semantics and generate the snippet-specific query representations. Furthermore, we design a cross-gating module to emphasize the crucial parts and weaken the irrelevant ones in the audio and query. We extensively evaluate the proposed CGI model on the public Audiogrounding dataset with significant improvements over several state-of-the-art methods. The ablation study demonstrate the consistent effectiveness of different modules in our model.
翻译:暂无翻译