Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance.
翻译:软体字嵌入模型的字词,映射了固定维向矢量的可变持续语句区段,使语音搜索和发现变得有效。 先前的工作探索了如何在目标语言中没有贴标签的数据的零资源环境中获得嵌入语言区段。 目前的最佳方法采用的是转移学习: 单一受监督的多语种模式使用来自多种资源丰富的语言的标签数据进行培训,然后应用到目标零资源语言( 不作微调 ) 。 但是, 具体来说, 培训语言的具体选择如何影响下游绩效还不清楚 。 我们在这里询问使用与目标相关的培训语言是否有益 。 使用来自南部非洲11种语言的数据, 我们尝试从不同语言家庭添加数据, 同时控制每种语言的数据数量 。 在文字歧视和逐个查询搜索评估中, 我们显示来自同一家庭的语言培训可以带来很大的改进 。 通过精细分析, 我们显示, 即使是单一相关语言的培训也能带来最大的收益 。 我们还发现, 添加来自不相关语言的数据通常不会损害绩效 。