Representing words and phrases into dense vectors of real numbers which encode semantic and syntactic properties is a vital constituent in natural language processing (NLP). The success of neural network (NN) models in NLP largely rely on such dense word representations learned on the large unlabeled corpus. Sindhi is one of the rich morphological language, spoken by large population in Pakistan and India lacks corpora which plays an essential role of a test-bed for generating word embeddings and developing language independent NLP systems. In this paper, a large corpus of more than 61 million words is developed for low-resourced Sindhi language for training neural word embeddings. The corpus is acquired from multiple web-resources using web-scrappy. Due to the unavailability of open source preprocessing tools for Sindhi, the prepossessing of such large corpus becomes a challenging problem specially cleaning of noisy data extracted from web resources. Therefore, a preprocessing pipeline is employed for the filtration of noisy text. Afterwards, the cleaned vocabulary is utilized for training Sindhi word embeddings with state-of-the-art GloVe, Skip-Gram (SG), and Continuous Bag of Words (CBoW) word2vec algorithms. The intrinsic evaluation approach of cosine similarity matrix and WordSim-353 are employed for the evaluation of generated Sindhi word embeddings. Moreover, we compare the proposed word embeddings with recently revealed Sindhi fastText (SdfastText) word representations. Our intrinsic evaluation results demonstrate the high quality of our generated Sindhi word embeddings using SG, CBoW, and GloVe as compare to SdfastText word representations.
翻译:将言词和词句代表成包含真实数字的密集矢量,从而将语义和语义内含属性编码,这是自然语言处理(NLP)中一个至关重要的组成部分。 NLP中神经网络(NNN)模型的成功在很大程度上依赖于在大型无标签地块上所学到的如此密集的字表。信德是巴基斯坦和印度大量人口所讲的丰富形态语言之一,缺乏一个Corpora,这在生成文字嵌入和开发独立的语言内含系统方面起着一个基本作用的测试台。在本文中,为低资源信德语言开发了6,100万个以上的字集,用于培训内含内含的内含字句。由于没有为信德提供开放源的预处理工具,因此,这种大字体的拥有成为棘手的问题,特别清理从网络资源中提取的噪音数据。因此,对杂质文本的拟议过滤。随后,使用清洁的词汇用于培训信德字词嵌入,用SGLOVS-T的直径直径、SGGS-VS的直径、SLO-G-GS-S-S-S-S-S-S-V-S-S-S-S-S-S-S-S-S-SLARD-S-S-V-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-IAR-S-S-S-S-S-S-S-SLAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S