Urdu is a widely spoken language in South Asia. Though immoderate literature exists for the Urdu language still the data isn't enough to naturally process the language by NLP techniques. Very efficient language models exist for the English language, a high resource language, but Urdu and other under-resourced languages have been neglected for a long time. To create efficient language models for these languages we must have good word embedding models. For Urdu, we can only find word embeddings trained and developed using the skip-gram model. In this paper, we have built a corpus for Urdu by scraping and integrating data from various sources and compiled a vocabulary for the Urdu language. We also modify fasttext embeddings and N-Grams models to enable training them on our built corpus. We have used these trained embeddings for a word similarity task and compared the results with existing techniques.
翻译:乌尔都语是南亚广泛使用的语言。 虽然乌尔都语有中等文学, 但数据仍不足以自然地用NLP技术处理语言。 英语、 高资源语言存在非常高效的语言模式, 但乌尔都语和其他资源不足的语言长期以来一直被忽略。 要为这些语言创建高效的语言模式, 我们必须有好的字嵌入模式。 对于乌尔都语, 我们只能用跳格模式来找到经过培训和开发的词嵌入模式。 在本文中, 我们通过从各种来源中提取和整合数据, 并编集乌尔都语词汇, 建立了乌尔都语的集合。 我们还修改了快速文本嵌入模式和N- grams 模式, 以便能够在我们构建的文体上对其进行培训。 我们用这些经过训练的嵌入模式来完成一个词相似的任务, 并将结果与现有技术进行比较 。