Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL heuristics are generally designed on the principle of selecting uncertain yet representative training instances, where annotating these instances may reduce a large number of errors. However, in an empirical study across six typologically diverse languages (German, Swedish, Galician, North Sami, Persian, and Ukrainian), we found the surprising result that even in an oracle scenario where we know the true uncertainty of predictions, these current heuristics are far from optimal. Based on this analysis, we pose the problem of AL as selecting instances which maximally reduce the confusion between particular pairs of output tags. Extensive experimentation on the aforementioned languages shows that our proposed AL strategy outperforms other AL strategies by a significant margin. We also present auxiliary results demonstrating the importance of proper calibration of models, which we ensure through cross-view training, and analysis demonstrating how our proposed strategy selects examples that more closely follow the oracle data distribution.
翻译:积极学习(AL) 使用数据选择算法来选择有用的培训样本,以尽量减少笔记费用。 现在,这是建立低资源综合分析器,如部分语音标记(POS)等的关键工具。 现有的AL文理学通常设计在选择不确定但有代表性的培训实例的原则上,指出这些情况可能会减少大量错误。 但是,在对六种类型多样的语言(德文、瑞典文、加利西亚文、北萨米文、波斯文和乌克兰文)进行的经验性研究中,我们发现一个令人惊讶的结果,即即使在我们了解预测的真实不确定性的甲骨文假设中,这些目前的超自然学分析器远非最佳。 基于这一分析,我们提出了AL的问题,即选择了最大限度减少特定产出标签之间混乱的事例。对上述语文的广泛实验表明,我们拟议的AL战略大大超越了其他的AL战略。 我们还提出了辅助结果,表明正确校准模型的重要性,我们通过交叉视图培训确保了这一点,并分析表明我们拟议的战略如何选择更接近数据分布的范例。