低收入语言新公司:信德人 (Word Embedding based New Corpus for Low-resourced Language: Sindhi)

Representing words and phrases into dense vectors of real numbers which encode semantic and syntactic properties is a vital constituent in natural language processing (NLP). The success of neural network (NN) models in NLP largely rely on such dense word representations learned on the large unlabeled corpus. Sindhi is one of the rich morphological language, spoken by large population in Pakistan and India lacks corpora which plays an essential role of a test-bed for generating word embeddings and developing language independent NLP systems. In this paper, a large corpus of more than 61 million words is developed for low-resourced Sindhi language for training neural word embeddings. The corpus is acquired from multiple web-resources using web-scrappy. Due to the unavailability of open source preprocessing tools for Sindhi, the prepossessing of such large corpus becomes a challenging problem specially cleaning of noisy data extracted from web resources. Therefore, a preprocessing pipeline is employed for the filtration of noisy text. Afterwards, the cleaned vocabulary is utilized for training Sindhi word embeddings with state-of-the-art GloVe, Skip-Gram (SG), and Continuous Bag of Words (CBoW) word2vec algorithms. The intrinsic evaluation approach of cosine similarity matrix and WordSim-353 are employed for the evaluation of generated Sindhi word embeddings. Moreover, we compare the proposed word embeddings with recently revealed Sindhi fastText (SdfastText) word representations. Our intrinsic evaluation results demonstrate the high quality of our generated Sindhi word embeddings using SG, CBoW, and GloVe as compare to SdfastText word representations.

翻译：将言词和词句代表成包含真实数字的密集矢量,从而将语义和语义内含属性编码,这是自然语言处理(NLP)中一个至关重要的组成部分。 NLP中神经网络(NNN)模型的成功在很大程度上依赖于在大型无标签地块上所学到的如此密集的字表。信德是巴基斯坦和印度大量人口所讲的丰富形态语言之一,缺乏一个Corpora,这在生成文字嵌入和开发独立的语言内含系统方面起着一个基本作用的测试台。在本文中,为低资源信德语言开发了6,100万个以上的字集,用于培训内含内含的内含字句。由于没有为信德提供开放源的预处理工具,因此,这种大字体的拥有成为棘手的问题,特别清理从网络资源中提取的噪音数据。因此,对杂质文本的拟议过滤。随后,使用清洁的词汇用于培训信德字词嵌入,用SGLOVS-T的直径直径、SGGS-VS的直径、SLO-G-GS-S-S-S-S-S-S-V-S-S-S-S-S-S-S-S-S-SLARD-S-S-V-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-IAR-S-S-S-S-S-S-S-SLAR-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【EMNLP2020】自然语言生成，Neural Language Generation

专知会员服务

39+阅读 · 2020年11月20日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【知识图谱嵌入补全综述论文】embedding models for knowledge base completion

专知会员服务

102+阅读 · 2020年4月25日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日