With an increase of dataset availability, the potential for learning from a variety of data sources has increased. One particular method to improve learning from multiple data sources is to embed the data source during training. This allows the model to learn generalizable features as well as distinguishing features between datasets. However, these dataset embeddings have mostly been used before contextualized transformer-based embeddings were introduced in the field of Natural Language Processing. In this work, we compare two methods to embed datasets in a transformer-based multilingual dependency parser, and perform an extensive evaluation. We show that: 1) embedding the dataset is still beneficial with these models 2) performance increases are highest when embedding the dataset at the encoder level 3) unsurprisingly, we confirm that performance increases are highest for small datasets and datasets with a low baseline score. 4) we show that training on the combination of all datasets performs similarly to designing smaller clusters based on language-relatedness.
翻译:随着数据集的可用性增加,从各种数据源学习的可能性增加了。改进从多种数据源学习的一个特殊方法是在培训期间嵌入数据源。这使模型能够学习一般特征以及数据集之间的区分特征。然而,在自然语言处理领域引入基于背景的变压器嵌入之前,这些数据集嵌入大多使用过。在这项工作中,我们比较了将数据集嵌入基于变压器的多语言依赖剖析器的两种方法,并进行了广泛的评价。我们表明:(1) 嵌入数据集仍然有益于这些模型;(2) 将数据集嵌入编码器3级时,性能提高幅度最高;不令人惊讶的是,我们确认小型数据集和基线分低的数据集的性能提高幅度最高。(4) 我们显示,关于所有数据集组合的培训与根据语言相关特性设计较小集群类似。