This paper presents the phonological, morphological, and syntactic distinctions between formal and informal Persian, showing that these two variants have fundamental differences that cannot be attributed solely to pronunciation discrepancies. Given that informal Persian exhibits particular characteristics, any computational model trained on formal Persian is unlikely to transfer well to informal Persian, necessitating the creation of dedicated treebanks for this variety. We thus detail the development of the open-source Informal Persian Universal Dependency Treebank, a new treebank annotated within the Universal Dependencies scheme. We then investigate the parsing of informal Persian by training two dependency parsers on existing formal treebanks and evaluating them on out-of-domain data, i.e. the development set of our informal treebank. Our results show that parsers experience a substantial performance drop when we move across the two domains, as they face more unknown tokens and structures and fail to generalize well. Furthermore, the dependency relations whose performance deteriorates the most represent the unique properties of the informal variant. The ultimate goal of this study that demonstrates a broader impact is to provide a stepping-stone to reveal the significance of informal variants of languages, which have been widely overlooked in natural language processing tools across languages.
翻译:本文介绍了正式的和非正式的波斯语之间的声学、形态学和综合学区别,表明这两个变式有着根本的区别,不能仅仅归因于发音的差异。鉴于非正式的波斯语具有独特的特点,任何在正式的波斯语上受过训练的计算模型都不可能很好地向非正式的波斯语转移,因此为这种多样性创建专门的树库是必要的。我们因此详细介绍了开放源的非正式的波斯普遍依赖树库的发展,这是在普遍附属体系中加注的新树库。我们然后调查非正式波斯语的分解,方法是培训两个关于现有正式树库的依赖性分析员,并评估外表数据,即我们非正式树库的开发情况。我们的结果显示,当我们跨两个领域移动时,这些分析员面临更多未知的象征和结构,而且不能普遍化。此外,其性能恶化最能代表非正式变式的独特特性的依附关系。我们研究的最终目标是,对非正式变式语言的更广泛影响,是提供一种基础,用来揭示各种语言的自然价值。