Labeled data is essential in modern systems that rely on Machine Learning (ML) for predictive modelling. Such systems may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios. Online financial fraud detection is an example where labeling is: i) expensive, or ii) it suffers from long delays, if relying on victims filing complaints. The latter may not be viable if a model has to be in place immediately, so an option is to ask analysts to label events while minimizing the number of annotations to control costs. We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where it is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies. Its observed gains over random sampling can reach 80% and be competitive with policies with an unlimited annotation budget or additional historical data (with 1/10 to 1/50 of the labels).
翻译:在依赖机器学习(ML)进行预测建模的现代系统中,标签式数据至关重要。这类系统可能受到冷启动问题的影响:受监督的模式运作良好,但最初没有标签,而且没有昂贵或缓慢的标签。在不平衡的数据假设中,这一问题甚至更为严重。在线金融欺诈检测是一个实例,其标签是:(一)昂贵的,或者(二)如果依赖受害者提出申诉,则数据会长期拖延。如果需要立即建立模型,后者可能不可行,因此,一个选项是请分析人员在将事件标签上贴上标签,同时尽量减少控制成本的注释数量。我们提议在冷启动的假设中,为等级不平衡程度级级的数据集设置一个主动学习(AL)说明系统。我们提出了一个基于计算效率高效的外部差异性AL方法(ODL),并设计了一个新的三阶段标签政策新颖的三阶段。然后,我们在四个真实的世界数据集中进行实验性研究,同时尽量减少控制成本的数值。我们的方法不会很快达到一个高水平的预算模型,而一个不限量性的预算模型比一个标准1(Oralal 10)的模型可以达到一个无限制性的预算模型。