We present iNLTK, an open-source NLP library consisting of pre-trained language models and out-of-the-box support for Paraphrase Generation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. By using pre-trained models from iNLTK for text classification on publicly available datasets, we significantly outperform previously reported results. On these datasets, we also show that by using pre-trained models and paraphrases from iNLTK, we can achieve more than 95% of the previous best performance by using less than 10% of the training data. iNLTK is already being widely used by the community and has 40,000+ downloads, 600+ stars and 100+ forks on GitHub. The library is available at https://github.com/goru001/inltk.
翻译:我们展示了iNLTK,这是一个开放源码的NLP图书馆,由预先培训的语言模型和对13种印度语的参数生成、文字相似性、句式嵌入式、文字嵌入、文字嵌入、调制和文本生成的框外支持组成。我们使用iNLTK的预培训模型对公开数据集进行文本分类,大大优于以前报告的结果。在这些数据集上,我们还显示,通过使用预先培训的模型和iNLTK的解说,我们可以通过使用不到10%的培训数据实现95%以上的前最佳性能。iNLTK已被社区广泛使用,在GitHub上已有40 000+下载、600+恒星和100+叉。图书馆可在https://github.com/goru001/inltk上查阅。