社区目前主要功能是问答和博客,支持文字、图片、视频、代码、公式、超链接,这些功能可以让你在描述问题 / 回答问题 / 写文章的时候用最清晰的方式来表达,还需要什么你说,我改。
@布莱克 • 丹尼
引用 第二部分第 12 课第 110 分钟:,在癌症检测模型中不平衡数据 (癌症通常只占 0.3%) 处理的回答(主讲人,jeremy,他创建了一个公司专门做癌症检测的)。我总结一下:
1. 为什么如果什么也不做,模型会不好?
如果癌症的数量很小时,模型在学习的时候会倾向于觉得整个样本集没有癌症,对于癌症样本,学习到的权重会很小甚至为 0.
2. 怎么建立模型?
但是要能够准确检测癌症又不能过拟合,这是两个极端的事情。首先要弄清楚 what's the smallest number of people with cancer that you can get away with,假设是 10%,创建一个模型,小批量数据 (mini-batch) 的没有癌症和有癌症样本比例(9:1),然后训练,如果这个模型效果很好,然后再减少有癌症的比例,再训练,。。。我认为这是最基本的技巧。
3. 在这个例子中,就是通过不断重复训练有癌症的样本,来加大有癌症的权重?
4. 那直接减少那些没有癌症的样本数量进行训练会不会的到相同的结果?
Question: Any advice on imbalanced datasets? Seems to be a recurring issue with real world data.
Answer: Unbalanced datasets, yes. There's not really that much clever you can do about it. A great example would be, one of the impact talks talked about breast cancer detection from mammography scans and this thing called the Dream Challenge had 0.3% of the scans actually had cancer. So that's very unbalanced.
The first thing you try to do with such an unbalanced data set is ignore it and try it and see how it goes. The reason it often doesn't go well is that the initial gradients will tend to point to say they never have cancer because that's going to give you a very accurate model.
One thing you can try and do is to come up with an initial model, which is like maybe some kind of heuristic which is not terrible and get to the point where the gradients don't always point to saying they never have cancer.
But the really obvious thing to do is to adjust your thing which is creating the mini-batches so that on every mini-batch you grab like half of it as being people with cancer and half of it people without cancer. That way you can still go through lots and lots of epochs. The challenge is the people that do have cancer, you're going to see lots and lots of times, so you have to be careful of overfitting.
Then basically there's things between those two extremes. So I think what you really need to do is figure out what's the smallest number of people with cancer that you can get away with. What's the smallest number where the gradients don't point to 0. Let's say it's 10%. So create a model where every mini-batch you create 10% of it with people with cancer and 90% people without. Train that for a while. The good news is once it's working pretty well, you can then decrease the has cancer size because you're already at the point where your model's not pointing off to 0. So you can kind of gradually start to change the sample to have less and less. I think that's the basic technique.
Question: So in this example where you're repeating the positive results over and over, you're essentially just weighting them more?
Answer: Yes.
Question: Could you get the same results by just throwing away a bunch of the false dataset?
Answer: Yes, you could do that and that's the really quick way to do it. But that way you're not using all, like the information about the false stuff still has information.
经典的如 SMOTE(字面翻译 - 综合少数样本的过抽样技术,大概理解),使用两个或者多个样本的距离作为度量标准判断相似度,然后把其中一个样本加上随机噪声(或者叫扰动,此值实在相邻的样本的差异之间)来生成新样本。
其他方法比如加权、用带惩罚的模型(比如 penalized-SVM 或者 penalized-LDA 等)。
或许你可以换个思路,把样本很不平衡问题换做异常点检测?或者用一分类(One-Class-SVM)?或许是考虑用 RandomForest 等对训练集随机采样的模型?
通常可以考虑 oversampling 或者 undersampling,修改 cost function 等方法,具体可以参考 Haibo He 老师关于不平衡数据处理的一篇高引综述。
可以参考《Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations》及其引文。写的特别详细。
关注 AI 研习社(okweiwu),回复 1 领取
【超过 1000G 神经网络 / AI / 大数据,教程,论文】