The NLP community has witnessed steep progress in a variety of tasks across the realms of monolingual and multilingual language processing recently. These successes, in conjunction with the proliferating mixed language interactions on social media have boosted interest in modeling code-mixed texts. In this work, we present CodemixedNLP, an open-source library with the goals of bringing together the advances in code-mixed NLP and opening it up to a wider machine learning community. The library consists of tools to develop and benchmark versatile model architectures that are tailored for mixed texts, methods to expand training sets, techniques to quantify mixing styles, and fine-tuned state-of-the-art models for 7 tasks in Hinglish. We believe this work has a potential to foster a distributed yet collaborative and sustainable ecosystem in an otherwise dispersed space of code-mixing research. The toolkit is designed to be simple, easily extensible, and resourceful to both researchers as well as practitioners.
翻译:最近,在单一语言和多语言处理领域的各种任务方面,国家语言方案社区取得了巨大进展,这些成功,加上社交媒体上日益扩大的混合语言互动,提高了对代码混合文本建模的兴趣。在这项工作中,我们展示了代码混合NLP(一个开放源库),这是一个开放源库,目的是将代码混合国家语言方案的进展汇集在一起,并将其开放给更广泛的机器学习界。图书馆由各种工具组成,用于开发和基准多功能模型结构,这些结构是专门为混合文本设计的,扩大培训套件的方法,对混合样式进行量化的技术,以及对Hingish的7项任务进行精细调整的最新模型。我们认为,这项工作有可能在一个原本分散的代码混合研究空间中促进分布的、合作和可持续的生态系统。工具包的设计简单、易于推广,对研究人员和从业人员都是资源丰富的。