Conducting experiments with diverse participants in their native languages can uncover insights into culture, cognition, and language that may not be revealed otherwise. However, conducting these experiments online makes it difficult to validate self-reported language proficiency. Furthermore, existing proficiency tests are small and cover only a few languages. We present an automated pipeline to generate vocabulary tests using text from Wikipedia. Our pipeline samples rare nouns and creates pseudowords with the same low-level statistics. Six behavioral experiments (N=236) in six countries and eight languages show that (a) our test can distinguish between native speakers of closely related languages, (b) the test is reliable ($r=0.82$), and (c) performance strongly correlates with existing tests (LexTale) and self-reports. We further show that test accuracy is negatively correlated with the linguistic distance between the tested and the native language. Our test, available in eight languages, can easily be extended to other languages.
翻译:与不同参与者以其本族语言进行实验,可以发现对文化、认知和语言的洞察力,否则可能无法披露。然而,在网上进行这些实验使得难以验证自我报告的语言熟练程度。此外,现有的熟练程度测试规模很小,只涵盖少数语言。我们展示了利用维基百科的文字进行词汇测试的自动化管道。我们的编审中样本有稀有的名词,并用同样的低层次统计数据制作假名。在6个国家和8种语言进行的六种行为实验(N=236)表明:(a) 我们的测试可以区分密切相关语言的母语,(b) 测试是可靠的(0.82美元),以及(c) 与现有的测试(LexTale)和自我报告密切相关。我们进一步表明,测试的准确性与测试语言和母语之间的距离有负关系。我们以8种语言提供的测试很容易推广到其他语言。