Important tasks like record linkage and extreme classification demonstrate extreme class imbalance, with 1 minority instance to every 1 million or more majority instances. Obtaining a sufficient sample of all classes, even just to achieve statistically-significant evaluation, is so challenging that most current approaches yield poor estimates or incur impractical cost. Where importance sampling has been levied against this challenge, restrictive constraints are placed on performance metrics, estimates do not come with appropriate guarantees, or evaluations cannot adapt to incoming labels. This paper develops a framework for online evaluation based on adaptive importance sampling. Given a target performance metric and model for $p(y|x)$, the framework adapts a distribution over items to label in order to maximize statistical precision. We establish strong consistency and a central limit theorem for the resulting performance estimates, and instantiate our framework with worked examples that leverage Dirichlet-tree models. Experiments demonstrate an average MSE superior to state-of-the-art on fixed label budgets.
翻译:记录联系和极端分类等重要任务显示极端阶级不平衡,每100万或100万以上多数案例有1个少数案例。获得所有各类的足够样本,即使仅仅为了达到具有统计意义的评估,也是极具挑战性,以至于大多数现行方法都会产生低估计或不切实际的成本。在对这一挑战进行重要抽样的情况下,对业绩衡量标准进行限制性限制,对估计数没有适当的保证,或评价无法适应所输入的标签。本文件根据适应性重要性抽样制定了在线评价框架。鉴于一个目标业绩衡量标准和美元(y ⁇ x)的模型,该框架调整了项目分布,以标签为标签,以尽量提高统计精确性。我们为由此产生的业绩估计建立强有力的一致性和中心限制,并用利用Drichlet-tree模型的工作范例来回动我们的框架。实验表明,在固定标签预算方面,平均MSE优于最新技术。