当与目标零资源语言有关的语言培训得到改进时,多语种传声词嵌入器的多语言转让得到改善 (Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language)

Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance.

翻译：软体字嵌入模型的字词,映射了固定维向矢量的可变持续语句区段,使语音搜索和发现变得有效。先前的工作探索了如何在目标语言中没有贴标签的数据的零资源环境中获得嵌入语言区段。目前的最佳方法采用的是转移学习: 单一受监督的多语种模式使用来自多种资源丰富的语言的标签数据进行培训,然后应用到目标零资源语言( 不作微调 ) 。但是, 具体来说, 培训语言的具体选择如何影响下游绩效还不清楚。我们在这里询问使用与目标相关的培训语言是否有益。使用来自南部非洲11种语言的数据, 我们尝试从不同语言家庭添加数据, 同时控制每种语言的数据数量。在文字歧视和逐个查询搜索评估中, 我们显示来自同一家庭的语言培训可以带来很大的改进。通过精细分析, 我们显示, 即使是单一相关语言的培训也能带来最大的收益。我们还发现, 添加来自不相关语言的数据通常不会损害绩效。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

23+阅读 · 2020年4月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【NAACL 2019 workshop】相似语言、变体和方言自然语言处理 The workshop on NLP for Similar Languages, Varieties and Dialects，约翰斯·霍普金斯大学|David Yarowsky

专知会员服务

5+阅读 · 2019年12月5日