Input distribution shift is one of the vital problems in unsupervised domain adaptation (UDA). The most popular UDA approaches focus on domain-invariant representation learning, trying to align the features from different domains into similar feature distributions. However, these approaches ignore the direct alignment of input word distributions between domains, which is a vital factor in word-level classification tasks such as cross-domain NER. In this work, we shed new light on cross-domain NER by introducing a subword-level solution, X-Piece, for input word-level distribution shift in NER. Specifically, we re-tokenize the input words of the source domain to approach the target subword distribution, which is formulated and solved as an optimal transport problem. As this approach focuses on the input level, it can also be combined with previous DIRL methods for further improvement. Experimental results show the effectiveness of the proposed method based on BERT-tagger on four benchmark NER datasets. Also, the proposed method is proved to benefit DIRL methods such as DANN.
翻译:投入分布变化是不受监督的域适应(UDA)的关键问题之一。最受欢迎的UDA方法侧重于域内差异性代表性学习,试图将不同领域的特点与相似的特性分布相统一,但是,这些方法忽略了域间输入字分布的直接一致,这是单词分类任务,如跨域 NER 等关键因素。在这项工作中,我们通过引入一个小字级解决方案(X-Piece),为NER的输入字级分布变化提供新的信息。具体地说,我们重新使用源域的输入字词,以接近目标子字分布,这是作为最佳的运输问题而拟订和解决的。由于这种方法侧重于输入级别,它也可以与以前的DIRL方法相结合,以便进一步改进。实验结果表明,基于四个基准 NER数据集的BERT-Tagger的拟议方法的有效性。此外,拟议的方法证明DIRL方法如DNNN。