In recent years, active learning has been successfully applied to an array of NLP tasks. However, prior work often assumes that training and test data are drawn from the same distribution. This is problematic, as in real-life settings data may stem from several sources of varying relevance and quality. We show that four popular active learning schemes fail to outperform random selection when applied to unlabelled pools comprised of multiple data sources on the task of natural language inference. We reveal that uncertainty-based strategies perform poorly due to the acquisition of collective outliers, i.e., hard-to-learn instances that hamper learning and generalization. When outliers are removed, strategies are found to recover and outperform random baselines. In further analysis, we find that collective outliers vary in form between sources, and show that hard-to-learn data is not always categorically harmful. Lastly, we leverage dataset cartography to introduce difficulty-stratified testing and find that different strategies are affected differently by example learnability and difficulty.
翻译:近些年来,积极学习被成功地应用于一系列自然语言推理任务。然而,先前的工作往往假定培训和测试数据是从同样的分布中提取的。这是个问题,因为在实际生活中,数据可能来自不同的相关性和质量的不同来源。我们发现,四个流行的积极学习计划在应用到由多种数据来源组成的、由多种自然语言推理任务组成的未贴标签的集合群时,未能优于随机选择。我们发现,由于获得集体外源,即阻碍学习和概括化的难读事件,基于不确定性的战略效果不佳。在删除外源时,发现战略可以回收并超越随机基线。在进一步的分析中,我们发现,各种来源之间的集体外源值不同,并表明难以读取的数据并不总是绝对有害。最后,我们利用数据集制图来引入难以批准的测试,发现不同的战略因实例的可学习性和困难而受影响不同。