冷天开始在不平衡数据流方面积极学习在线培训 (Active learning for online training in imbalanced data streams under cold start)

Labeled data is essential in modern systems that rely on Machine Learning (ML) for predictive modelling. Such systems may suffer from the cold-start problem: supervised models work well but, initially, there are no labels, which are costly or slow to obtain. This problem is even worse in imbalanced data scenarios. Online financial fraud detection is an example where labeling is: i) expensive, or ii) it suffers from long delays, if relying on victims filing complaints. The latter may not be viable if a model has to be in place immediately, so an option is to ask analysts to label events while minimizing the number of annotations to control costs. We propose an Active Learning (AL) annotation system for datasets with orders of magnitude of class imbalance, in a cold start streaming scenario. We present a computationally efficient Outlier-based Discriminative AL approach (ODAL) and design a novel 3-stage sequence of AL labeling policies where it is used as warm-up. Then, we perform empirical studies in four real world datasets, with various magnitudes of class imbalance. The results show that our method can more quickly reach a high performance model than standard AL policies. Its observed gains over random sampling can reach 80% and be competitive with policies with an unlimited annotation budget or additional historical data (with 1/10 to 1/50 of the labels).

翻译：在依赖机器学习(ML)进行预测建模的现代系统中,标签式数据至关重要。这类系统可能受到冷启动问题的影响:受监督的模式运作良好,但最初没有标签,而且没有昂贵或缓慢的标签。在不平衡的数据假设中,这一问题甚至更为严重。在线金融欺诈检测是一个实例,其标签是:(一)昂贵的,或者(二)如果依赖受害者提出申诉,则数据会长期拖延。如果需要立即建立模型,后者可能不可行,因此,一个选项是请分析人员在将事件标签上贴上标签,同时尽量减少控制成本的注释数量。我们提议在冷启动的假设中,为等级不平衡程度级级的数据集设置一个主动学习(AL)说明系统。我们提出了一个基于计算效率高效的外部差异性AL方法(ODL),并设计了一个新的三阶段标签政策新颖的三阶段。然后,我们在四个真实的世界数据集中进行实验性研究,同时尽量减少控制成本的数值。我们的方法不会很快达到一个高水平的预算模型,而一个不限量性的预算模型比一个标准1(Oralal 10)的模型可以达到一个无限制性的预算模型。

相关内容

主动学习

关注 240

主动学习是机器学习（更普遍的说是人工智能）的一个子领域，在统计学领域也叫查询学习、最优实验设计。“学习模块”和“选择策略”是主动学习算法的2个基本且重要的模块。主动学习是“一种学习方法，在这种方法中，学生会主动或体验性地参与学习过程，并且根据学生的参与程度，有不同程度的主动学习。” （Bonwell＆Eison 1991）Bonwell＆Eison（1991）指出：“学生除了被动地听课以外，还从事其他活动。” 在高等教育研究协会（ASHE）的一份报告中，作者讨论了各种促进主动学习的方法。他们引用了一些文献，这些文献表明学生不仅要做听，还必须做更多的事情才能学习。他们必须阅读，写作，讨论并参与解决问题。此过程涉及三个学习领域，即知识，技能和态度（KSA）。这种学习行为分类法可以被认为是“学习过程的目标”。特别是，学生必须从事诸如分析，综合和评估之类的高级思维任务。

【经典书】机器学习白话书，97页pdf，Machine Learning for Humans

专知会员服务

87+阅读 · 2021年1月11日