We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we outperform all currently published methods on several metrics. Code and model weights are available at https://github.com/jasonppy/word-discovery.
翻译:我们展示了一种基于视觉的口语发现方法。 在对 HuBERT 或 wav2vec2.0 模型进行将口语描述与自然图像联系起来的培训后, 我们发现该模型的自我关注头部中出现了强大的字分割和集束能力。 我们的实验显示, HuBERT 和 wav2vec2.0 模型中几乎不存在这种能力, 这表明视觉地面任务是我们所观察到的字性发现能力的一个关键组成部分。 我们还评估了我们在 Buckeye 单词分割和 ZeroSpeech 语音发现任务上的方法, 在那里,我们超越了目前发表的关于若干指标的所有方法。 代码和模型重量可以在 https://github.com/jasonppy/word-discovery 上查阅。