In this paper we present the dataset of Himachali low resource endangered language, Kangri (ISO 639-3xnr) listed in the United Nations Educational, Scientific and Cultural Organization (UNESCO). The compilation of kangri corpus has been a challenging task due to the non-availability of the digitalized resources. The corpus contains 1,81,552 Monolingual and 27,362 Hindi-Kangri Parallel corpora. We shared pre-trained kangri word embeddings. We also reported the Bilingual Evaluation Understudy (BLEU) score and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) results for the corpus. The corpus is freely available for non-commercial usages and research. To the best of our knowledge, this is the first Himachali low resource endangered language corpus. The resources are available at (https://github.com/chauhanshweta/Kangri_corpus)
翻译:在本文中,我们介绍了联合国教育、科学及文化组织(教科文组织)所列的Himachali低资源濒危语言Kangri(ISO 639-3xnr)的数据集,由于没有数字化资源,汇编Kangripro(Kangri)是一项具有挑战性的任务,该数据集包括1,81,552个单语和27,362个Hindi-Kangri平行公司。我们分享了经过预先训练的Kangri字嵌入。我们还报告了双语评价基础学分和有明确操作的统计机器翻译和神经机器翻译结果的翻译评价方法(METEOR)得分(SMT),该软件免费供非商业用途和研究使用。据我们所知,这是第一个Himachali低资源濒危语言包(https://github.com/chauhanshweta/Kangri_corpus)。