Code-switching (CS), a ubiquitous phenomenon due to the ease of communication it offers in multilingual communities still remains an understudied problem in language processing. The primary reasons behind this are: (1) minimal efforts in leveraging large pretrained multilingual models, and (2) the lack of annotated data. The distinguishing case of low performance of multilingual models in CS is the intra-sentence mixing of languages leading to switch points. We first benchmark two sequence labeling tasks -- POS and NER on 4 different language pairs with a suite of pretrained models to identify the problems and select the best performing model, char-BERT, among them (addressing (1)). We then propose a self training method to repurpose the existing pretrained models using a switch-point bias by leveraging unannotated data (addressing (2)). We finally demonstrate that our approach performs well on both tasks by reducing the gap between the switch point performance while retaining the overall performance on two distinct language pairs in both the tasks. Our code is available here: https://github.com/PC09/EMNLP2021-Switch-Point-biased-Self-Training.
翻译:由于多语种社区交流的方便,守则转换是一种普遍存在的现象,由于多语种社区提供的交流便利,这种普遍现象在语言处理方面仍然是一个未得到充分研究的问题,其主要原因是:(1) 利用大型预先训练的多语种模式的努力极少,(2) 缺乏附加说明的数据。在守则转换模式中,多语种模式表现差的区别是导致切换点的语言内部混合。我们首先将四个不同语言配对的POS和NER作为两个顺序标签任务的基准,并配有一套经过预先训练的模型,以查明问题,并选择其中最佳的模型,Char-BERT(处理(1))。我们然后提出一种自我培训方法,通过利用未加说明的数据(处理(2)),利用开关点偏差来重新定位现有的预先训练模式。我们最后证明,我们的方法在这两项任务上都很好地发挥了作用,缩小了开关点性能之间的差距,同时保留了两个不同语言配对在两项任务上的总体性能。我们的代码可以在这里查到:https://github.com/ EMNP2021-PGo-Based-bas-selfard-selfard-selvealtistration-