We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory usage and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages. Our toolkit along with pretrained models and code are publicly available at: https://github.com/nlp-uoregon/trankit. A demo website for our toolkit is also available at: http://nlp.uoregon.edu/trankit. Finally, we create a demo video for Trankit at: https://youtu.be/q0KGP3zGjGc.
翻译:我们引进了Trankit, 这是一种基于轻量级变异器的多语言自然语言处理工具(NLP),它为NLP100多种语言的基本任务和56种语言的90个预先训练管道提供了可培训的管道。在最先进的预先训练语言模式上,Trankit大大优于以前多语言的NLP管道,涉及句分割、部分语音标签、形态特征标记和依赖性区分,同时保持象征性化、多词符号扩展和90多个普遍附属树库的竞争性工作。尽管使用了大型预先训练的变异器,但我们的工具包在记忆使用和速度方面仍然很有效率。这是通过我们与适应者的新颖的插座和游戏机制实现的,在那里,不同语言的管道之间共享一种多语言的预先训练变异器。我们的工具包以及预先训练的模式和代码可以公开查阅:https://github.com/nlp-oregon/trankit。我们工具包的演示网站也在以下网址上:http://Kngoustrov.