The ability to train complex and highly effective models often requires an abundance of training data, which can easily become a bottleneck in cost, time, and computational resources. Batch active learning, which adaptively issues batched queries to a labeling oracle, is a common approach for addressing this problem. The practical benefits of batch sampling come with the downside of less adaptivity and the risk of sampling redundant examples within a batch -- a risk that grows with the batch size. In this work, we analyze an efficient active learning algorithm, which focuses on the large batch setting. In particular, we show that our sampling method, which combines notions of uncertainty and diversity, easily scales to batch sizes (100K-1M) several orders of magnitude larger than used in previous studies and provides significant improvements in model training efficiency compared to recent baselines. Finally, we provide an initial theoretical analysis, proving label complexity guarantees for a related sampling method, which we show is approximately equivalent to our sampling method in specific settings.
翻译:培训复杂和高效模型的能力往往需要大量的培训数据,这些数据很容易成为成本、时间和计算资源的瓶颈。批量积极学习,在适应性上将查询分解成一个标签符,是解决这一问题的共同办法。批量抽样的实际好处是适应性差和在批量内采样冗余实例的风险 -- -- 这种风险随着批量规模的增加而增加。在这项工作中,我们分析了一种高效的积极学习算法,它侧重于大批量设置。特别是,我们表明,我们的取样方法将不确定性和多样性的概念结合起来,容易将批量规模(100K-1M)分成几个数量级,比以前的研究所用数量大得多,并且与最近的基线相比,模型培训效率有显著提高。最后,我们提供了初步的理论分析,证明相关取样方法的标签复杂性保证,我们显示这种方法与我们在特定环境下的取样方法大致相当。