The automatic classification of Arabic dialects is an ongoing research challenge, which has been explored in recent work that defines dialects based on increasingly limited geographic areas like cities and provinces. This paper focuses on a related yet relatively unexplored topic: the effects of the geographical proximity of cities located in Arab countries on their dialectical similarity. Our work is twofold, reliant on: 1) comparing the textual similarities between dialects using cosine similarity and 2) measuring the geographical distance between locations. We study MADAR and NADI, two established datasets with Arabic dialects from many cities and provinces. Our results indicate that cities located in different countries may in fact have more dialectical similarity than cities within the same country, depending on their geographical proximity. The correlation between dialectical similarity and city proximity suggests that cities that are closer together are more likely to share dialectical attributes, regardless of country borders. This nuance provides the potential for important advancements in Arabic dialect research because it indicates that a more granular approach to dialect classification is essential to understanding how to frame the problem of Arabic dialects identification.
翻译:阿拉伯方言的自动分类是一项持续的研究挑战,最近的工作根据城市和省份等日益有限的地理区域界定方言,对此进行了探讨。本文件侧重于一个相关但相对未探讨的专题:阿拉伯国家城市地理相近对其辩证相似性的影响。我们的工作有两个方面,取决于:(1) 比较方言之间的文字相似性;(2) 测量不同地点之间的地理距离。我们研究了MADAR和NADI,两个既有的、来自许多城市和省份的阿拉伯方言数据集。我们的结果表明,不同国家的城市根据地理相近性,可能事实上比同一国家的城市具有更多的辩证相似性。辩证相似性和城市相近性之间的关联性表明,更接近的城市更有可能分享辩证特征,而不论国界如何。这种细微的细微差别为阿拉伯语方言研究提供了重要进展的潜力,因为它表明,对于如何界定阿拉伯方言识别问题,必须采用更为微的方言分类方法。