Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in "He {} running". In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.
翻译:为每个人服务的建设NLP系统需要考虑方言差异。但方言并不是单一实体:相反,言语和文字中数十种方言特征的存在、缺乏和频率,如删除“He ⁇ running ” 中的 Copula。在本文中,我们引入了方言特征探测任务,并提出了两种多任务学习方法,这两种方法都以预先培训的变压器为基础。对于大多数方言来说,没有对这些特征的大型附加说明的体,因此难以培训识别者。我们用少量的最小对子来培训我们的模型,其基础是语言学家通常如何定义方言特征。对一套由印度英语的22个方言特征组成的测试评价显示,这些模型学会了高度精准地识别许多特征,并且有几对最起码的对子可以有效地培训,如同数千个标注的例子。我们还展示了方言特征检测作为方言密度和方言分类方法的下游适用性。