会员服务 ·

有没有处理数据不平衡的方法？ | 社区问答

2017 年 12 月 24 日 AI研习社 AI研习社

这里是 AI 研习社，我们的社区已经正式推出了！欢迎大家来多多交流~

mooc.ai/bbs

（戳文末阅读原文直接进）

社长为你推荐来自 AI 研习社问答社区的精华问答。如有你也有问题，欢迎进社区提问。

一个小介绍：

社区目前主要功能是问答和博客，支持文字、图片、视频、代码、公式、超链接，这些功能可以让你在描述问题 / 回答问题 / 写文章的时候用最清晰的方式来表达，还需要什么你说，我改。

话不多说，直接上题

问：有没有处理数据不平衡的方法？

请问，有没有处理数据不平衡的方法呢？好的方法，其次有什么好的数据增强的方法呢？

来自社友的回答

▼▼▼

@布莱克 • 丹尼

引用 fast.ai 第二部分第 12 课第 110 分钟：http://t.cn/RHA5UJf，在癌症检测模型中不平衡数据 (癌症通常只占 0.3%) 处理的回答（主讲人，jeremy，他创建了一个公司专门做癌症检测的）。我总结一下：

1. 为什么如果什么也不做，模型会不好？

如果癌症的数量很小时，模型在学习的时候会倾向于觉得整个样本集没有癌症，对于癌症样本，学习到的权重会很小甚至为 0.

2. 怎么建立模型？

一种可行的方式是，建立一个初始模型，然后采用特定的小批量数据进行训练，小批量数据的正负样本比例是（1:1），然后重复多次，让模型能够学到癌症样本的权重，但是要采用一些防止过拟合的技术。

但是要能够准确检测癌症又不能过拟合，这是两个极端的事情。首先要弄清楚 what's the smallest number of people with cancer that you can get away with，假设是 10%，创建一个模型，小批量数据 (mini-batch) 的没有癌症和有癌症样本比例（9:1），然后训练，如果这个模型效果很好，然后再减少有癌症的比例，再训练，。。。我认为这是最基本的技巧。

3. 在这个例子中，就是通过不断重复训练有癌症的样本，来加大有癌症的权重？

是的

4. 那直接减少那些没有癌症的样本数量进行训练会不会的到相同的结果？

是的，这是一种快速的方法，但是这样你就你不能用到所有的信息（所有没有癌症的样本的信息）

原文贴在这里，总结的不好还请自行理解：

Question: Any advice on imbalanced datasets? Seems to be a recurring issue with real world data.

Answer: Unbalanced datasets, yes. There's not really that much clever you can do about it. A great example would be, one of the impact talks talked about breast cancer detection from mammography scans and this thing called the Dream Challenge had 0.3% of the scans actually had cancer. So that's very unbalanced.

The first thing you try to do with such an unbalanced data set is ignore it and try it and see how it goes. The reason it often doesn't go well is that the initial gradients will tend to point to say they never have cancer because that's going to give you a very accurate model.

One thing you can try and do is to come up with an initial model, which is like maybe some kind of heuristic which is not terrible and get to the point where the gradients don't always point to saying they never have cancer.

But the really obvious thing to do is to adjust your thing which is creating the mini-batches so that on every mini-batch you grab like half of it as being people with cancer and half of it people without cancer. That way you can still go through lots and lots of epochs. The challenge is the people that do have cancer, you're going to see lots and lots of times, so you have to be careful of overfitting.

Then basically there's things between those two extremes. So I think what you really need to do is figure out what's the smallest number of people with cancer that you can get away with. What's the smallest number where the gradients don't point to 0. Let's say it's 10%. So create a model where every mini-batch you create 10% of it with people with cancer and 90% people without. Train that for a while. The good news is once it's working pretty well, you can then decrease the has cancer size because you're already at the point where your model's not pointing off to 0. So you can kind of gradually start to change the sample to have less and less. I think that's the basic technique.

Question: So in this example where you're repeating the positive results over and over, you're essentially just weighting them more?

Answer: Yes.

Question: Could you get the same results by just throwing away a bunch of the false dataset?

Answer: Yes, you could do that and that's the really quick way to do it. But that way you're not using all, like the information about the false stuff still has information.

@MicoonZhang

比较简单常用的比如：

数据少的时候常使用上采样（oversampling），复制观测值少的类的样本
数据多的时候常使用下采样（undersampling），去除观测值多的类的样本

再就是可以通过算法生成不平衡样本：

经典的如 SMOTE（字面翻译 - 综合少数样本的过抽样技术，大概理解），使用两个或者多个样本的距离作为度量标准判断相似度，然后把其中一个样本加上随机噪声（或者叫扰动，此值实在相邻的样本的差异之间）来生成新样本。

其他方法比如加权、用带惩罚的模型（比如 penalized-SVM 或者 penalized-LDA 等）。

或许你可以换个思路，把样本很不平衡问题换做异常点检测？或者用一分类（One-Class-SVM）？或许是考虑用 RandomForest 等对训练集随机采样的模型？

这种问题和业务需求也有很强的相关性，可能根据领域知识也能解决一些问题。

@mojuan

通常可以考虑 oversampling 或者 undersampling，修改 cost function 等方法，具体可以参考 Haibo He 老师关于不平衡数据处理的一篇高引综述。

@JianJuly

可以参考《Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations》及其引文。写的特别详细。