This study investigates the use of Visually Grounded Speech (VGS) models for keyword localisation in speech. The study focusses on two main research questions: (1) Is keyword localisation possible with VGS models and (2) Can keyword localisation be done cross-lingually in a real low-resource setting? Four methods for localisation are proposed and evaluated on an English dataset, with the best-performing method achieving an accuracy of 57%. A new dataset containing spoken captions in Yoruba language is also collected and released for cross-lingual keyword localisation. The cross-lingual model obtains a precision of 16% in actual keyword localisation and this performance can be improved by initialising from a model pretrained on English data. The study presents a detailed analysis of the model's success and failure modes and highlights the challenges of using VGS models for keyword localisation in low-resource settings.
翻译:本研究调查了语言中关键词本地化使用视觉基调模式的情况。本研究侧重于两个主要研究问题:(1) 关键词本地化是否可用VGS模式;(2) 关键词本地化能否在真正的低资源环境下以跨语言进行?在英语数据集中提出并评估了四种本地化方法,最佳方法达到57%。还收集了含有Yoruba语口语字幕的新数据集,并发布用于跨语言关键词本地化。跨语言模型在实际关键词本地化中获得了16%的精确度,这一性能可以通过在英语数据上预先培训的模型初始化而得到改进。该研究详细分析了该模型的成功和失败模式,并突出强调了在低资源环境中使用VGS语言本地化模式的挑战。