When training a machine learning classifier on data where one of the classes is intrinsically rare, the classifier will often assign too few sources to the rare class. To address this, it is common to up-weight the examples of the rare class to ensure it isn't ignored. It is also a frequent practice to train on restricted data where the balance of source types is closer to equal for the same reason. Here we show that these practices can bias the model toward over-assigning sources to the rare class. We also explore how to detect when training data bias has had a statistically significant impact on the trained model's predictions, and how to reduce the bias's impact. While the magnitude of the impact of the techniques developed here will vary with the details of the application, for most cases it should be modest. They are, however, universally applicable to every time a machine learning classification model is used, making them analogous to Bessel's correction to the sample variance.
翻译:当对机器学习分类师进行关于某一类本质上少见的数据的培训时, 分类师往往会给稀有类分配太多的来源。 要解决这个问题, 通常的做法是提高稀有类的例子的比重, 以确保它不被忽略。 也是一种常见的做法, 训练限制数据, 因为同一原因, 来源种类的平衡更接近于相等。 我们在这里显示, 这些做法可以将模式偏向于将来源过多分配到稀有类。 我们还会探索当培训数据偏差对经过培训的模型的预测产生统计上显著的影响时如何检测, 以及如何减少偏差的影响。 虽然在这里开发的技术影响的规模会随应用程序的细节而变化, 但在大多数情况下, 其影响应该小一些。 但是, 每当使用机器学习分类模式时, 它们都会普遍适用, 使其类似于Bessel 的校正和样本差异 。