用于低资源英语品种中调频合成特征探测的 Corpus-Guided Contrast sets (Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties)

The study of language variation examines how language varies between and within different groups of speakers, shedding light on how we use language to construct identities and how social contexts affect language use. A common method is to identify instances of a certain linguistic feature - say, the zero copula construction - in a corpus, and analyze the feature's distribution across speakers, topics, and other variables, to either gain a qualitative understanding of the feature's function or systematically measure variation. In this paper, we explore the challenging task of automatic morphosyntactic feature detection in low-resource English varieties. We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits. We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.

翻译：语言变异研究考察不同组别之间和不同组别内部的语言差异,说明我们如何使用语言构建身份,以及社会背景如何影响语言使用。一个共同的方法是在一个文体中识别某种语言特征的事例,例如零千叶结构,分析该特征在讲者、专题和其他变量之间的分布,以便从质量上了解该特征的功能,或者系统地衡量差异。在本文中,我们探讨了在低资源英语品种中自动检测地貌特征的艰巨任务。我们提出了一个“人与人间交流”方法,通过实体指导编辑生成和过滤有效的对比组。我们表明,我们的方法改善了印度英语和非裔美国人英语的特征检测,展示了它如何帮助语言研究,并公布了我们经过精细调整的模式供其他研究人员使用。