Recent complementary strands of research have shown that leveraging information on the data source through encoding their properties into embeddings can lead to performance increase when training a single model on heterogeneous data sources. However, it remains unclear in which situations these dataset embeddings are most effective, because they are used in a large variety of settings, languages and tasks. Furthermore, it is usually assumed that gold information on the data source is available, and that the test data is from a distribution seen during training. In this work, we compare the effect of dataset embeddings in mono-lingual settings, multi-lingual settings, and with predicted data source label in a zero-shot setting. We evaluate on three morphosyntactic tasks: morphological tagging, lemmatization, and dependency parsing, and use 104 datasets, 66 languages, and two different dataset grouping strategies. Performance increases are highest when the datasets are of the same language, and we know from which distribution the test-instance is drawn. In contrast, for setups where the data is from an unseen distribution, performance increase vanishes.
翻译:最近的补充研究显示,通过将数据源的属性编码成嵌入器来利用数据源的信息,在培训关于多元数据源的单一模型时,可以提高性能。然而,尚不清楚这些数据集嵌入器在哪些情况下最为有效,因为它们用于多种环境、语言和任务。此外,通常假定数据源的金信息是可得的,测试数据来自培训期间的分布。在这项工作中,我们比较了数据集嵌入单语设置、多语言设置和预测数据源标签在零分位设置中的效果。我们评估了三种形态合成任务:形态标记、列位化和依赖性分解,并使用104个数据集、66种语言和两种不同的数据集组合战略。当数据集是同一语言时,性能提高最高,我们知道从中绘制测试源。相比之下,在数据来自无形分布的设置方面,性能增加消失率。