Visually grounded speech (VGS) models are trained on images paired with unlabelled spoken captions. Such models could be used to build speech systems in settings where it is impossible to get labelled data, e.g. for documenting unwritten languages. However, most VGS studies are in English or other high-resource languages. This paper attempts to address this shortcoming. We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yor\`ub\'a -- a real low-resource language spoken in Nigeria. We train an attention-based VGS model where images are automatically tagged with English visual labels and paired with Yor\`ub\'a utterances. This enables cross-lingual keyword localisation: a written English query is detected and located in Yor\`ub\'a speech. To quantify the effect of the smaller dataset, we compare to English systems trained on similar and more data. We hope that this new dataset will stimulate research in the use of VGS models for real low-resource languages.
翻译:视觉辅助语言( VGS) 模型在配有未贴标签的口语字幕的图像上接受培训。 这些模型可用于在无法获取标签数据的情况下建立语音系统, 例如用于记录非书面语言。 然而, 大多数 VGS 研究是英语或其他高资源语言。 本文试图解决这一缺陷。 我们收集并发布一个新的单声频字幕数据集, 用于6k Flickr 图像的 Yor ⁇ ub\'a, 这是尼日利亚讲的真正一种低资源语言。 我们训练一个关注的 VGS 模型, 用于自动用英文视觉标签标记图像, 并与 Yor ⁇ ub\' a 语句配对。 这样可以实现跨语种关键词本地化: 检测到书面英语查询, 并在 Yor ⁇ ub\' a 语音中找到。 为了量化小数据集的效果, 我们比较了在类似和更多数据方面受过训练的英语系统。 我们希望这个新数据集将刺激在实际低资源语言中使用 VGS 模型时进行研究 。