Automated Scoring (AS), the natural language processing task of scoring essays and speeches in an educational testing setting, is growing in popularity and being deployed across contexts from government examinations to companies providing language proficiency services. However, existing systems either forgo human raters entirely, thus harming the reliability of the test, or score every response by both human and machine thereby increasing costs. We target the spectrum of possible solutions in between, making use of both humans and machines to provide a higher quality test while keeping costs reasonable to democratize access to AS. In this work, we propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently. We propose reward sampling and observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget (30% samples) using our proposed sampling. The accuracy increase observed using standard random and importance sampling baselines are 8.6% and 12.2% respectively. Furthermore, we demonstrate the system's model agnostic nature by measuring its performance on a variety of models currently deployed in an AS setting as well as pseudo models. Finally, we propose an algorithm to estimate the accuracy/QWK with statistical guarantees (Our code is available at https://git.io/J1IOy).
翻译:自动Scoring(AS)是教育测试环境中评分论文和演讲的自然语言处理任务,在教育测试环境中,这种自然语言处理任务越来越受欢迎,并且从政府考试到提供语言熟练服务的公司,在各种背景中部署,但是,现有的系统要么完全放弃人速率,从而损害测试的可靠性,或者通过人体和机器的每一次反应得分,从而增加成本。我们的目标是在两种可能的解决办法之间,利用人和机器提供更高的质量测试,同时保持使获得AS的民主化的合理成本。在这项工作中,我们建议结合现有的模式,抽样反应将人类明智地评分。我们提议奖励抽样,并观察到在准确性方面(平均增加19.80%)和四边加权 kappa(QWK)(平均25.60%)取得显著进展,使用我们提议的抽样相对较少的人力预算(30%的样本)。我们用标准的随机和重要取样基线观测到的准确度提高幅度分别为8.6%和12.2%。此外,我们通过测量目前在AS设置中部署的各种模型的性能衡量其性。我们最后建议以统计模型/QALO的精确性来测算。