We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we outperform all currently published methods on several metrics.
翻译:我们展示了一种基于视觉的口语发现方法。 在对 HuBERT 或 wav2vec2.0 模型进行将口语描述与自然图像联系起来的培训后, 我们发现在模型的自我关注头部中出现了强大的单词分割和集成能力。 我们的实验显示,在 HuBERT 和 wav2vec2.0 模型中,这种能力几乎没有达到同样的程度, 这表明视觉地面任务是我们观察到的单词发现能力的一个关键组成部分。 我们还评估了我们在 Buckeye 单词分割和 ZeroSpeech 口语发现任务上的方法, 在那里,我们超越了目前发表的关于数个计量的方法。