The amount of audio data available on public websites is growing rapidly, and an efficient mechanism for accessing the desired data is necessary. We propose a content-based audio retrieval method that can retrieve a target audio that is similar to but slightly different from the query audio by introducing auxiliary textual information which describes the difference between the query and target audio. While the range of conventional content-based audio retrieval is limited to audio that is similar to the query audio, the proposed method can adjust the retrieval range by adding an embedding of the auxiliary text query-modifier to the embedding of the query sample audio in a shared latent space. To evaluate our method, we built a dataset comprising two different audio clips and the text that describes the difference. The experimental results show that the proposed method retrieves the paired audio more accurately than the baseline. We also confirmed based on visualization that the proposed method obtains the shared latent space in which the audio difference and the corresponding text are represented as similar embedding vectors.
 翻译:公共网站上的音频数据数量正在迅速增长,而且获取所需数据的高效机制是必要的。我们建议采用基于内容的音频检索方法,通过引入辅助文本信息,说明查询和目标音频之间的差异,从而检索与查询音频相似但略有不同的目标音频。传统内容的音频检索范围限于与查询音频相似的音频,而拟议方法可以通过在将查询样本音频嵌入共享的潜在空间而调整检索范围。为了评估我们的方法,我们建立了一个数据集,由两种不同的音频剪和描述差异的文字组成。实验结果表明,拟议方法检索配对音频比基线更准确。我们还根据视觉化确认,拟议方法获得了共同的潜藏空间,其中的音频差异和对应文本代表相似的嵌入矢量。