数据正负样本数不平衡的解决方法-SMOTE - 专知

会员服务 ·

0

数据正负样本数不平衡的解决方法-SMOTE

2018 年 3 月 14 日 凡人机器学习

点击蓝字关注这个神奇的公众号～

SMOTE（Synthetic Minority Oversampling Technique），合成少数类过采样技术．它是基于随机过采样算法的一种改进方案，由于随机过采样采取简单复制样本的策略来增加少数类样本，这样容易产生模型过拟合的问题，即使得模型学习到的信息过于特别(Specific)而不够泛化(General)，SMOTE算法的基本思想是对少数类样本进行分析并根据少数类样本人工合成新样本添加到数据集中，具体如下图所示，算法流程如下。

(1)对于少数类中每一个样本x，以欧氏距离为标准计算它到少数类样本集中所有样本的距离，得到其k近邻。
(2)根据样本不平衡比例设置一个采样比例以确定采样倍率N，对于每一个少数类样本x，从其k近邻中随机选择若干个样本，假设选择的近邻为xn。
(3)对于每一个随机选出的近邻xn，分别与原样本按照如下的公式构建新的样本。

smote算法的伪代码如下：

python代码实现如下：

import randomfrom sklearn.neighbors import NearestNeighborsimport numpy as npclass Smote:
    def __init__(self,samples,N=10,k=5):
        self.n_samples,self.n_attrs=samples.shape
        self.N=N
        self.k=k
        self.samples=samples
        self.newindex=0
       # self.synthetic=np.zeros((self.n_samples*N,self.n_attrs))

    def over_sampling(self):
        N=int(self.N/100)
        self.synthetic = np.zeros((self.n_samples * N, self.n_attrs))
        neighbors=NearestNeighbors(n_neighbors=self.k).fit(self.samples)        print 'neighbors',neighbors        for i in range(len(self.samples)):
            nnarray=neighbors.kneighbors(self.samples[i].reshape(1,-1),return_distance=False)[0]            #print nnarray
            self._populate(N,i,nnarray)        return self.synthetic    # for each minority class samples,choose N of the k nearest neighbors and generate N synthetic samples.
    def _populate(self,N,i,nnarray):
        for j in range(N):
            nn=random.randint(0,self.k-1)
            dif=self.samples[nnarray[nn]]-self.samples[i]
            gap=random.random()
            self.synthetic[self.newindex]=self.samples[i]+gap*dif
            self.newindex+=1a=np.array([[1,2,3],[4,5,6],[2,3,1],[2,1,2],[2,3,4],[2,3,4]])
s=Smote(a,N=100)print s.over_sampling()

马上登机了~觉得这个文章不错，火速分享给大家！

转自：http://blog.csdn.net/Yaphat/article/details/52463304?locationNum=7

你可以选择关注我

也可以不关注

微信号：凡人机器学习

长按二维码关注

登录查看更多

2

相关内容

过采样

基于改进卷积神经网络的短文本分类模型

基于改进卷积神经网络的短文本分类模型

专知会员服务

26+阅读 · 2020年7月22日

克服小样本学习中灾难性遗忘方法研究

专知会员服务

51+阅读 · 2020年7月16日

【KDD2020】最小方差采样用于图神经网络的快速训练

【KDD2020】最小方差采样用于图神经网络的快速训练

专知会员服务

28+阅读 · 2020年7月13日

【CVPR2020-北京大学】自适应间隔损失的提升小样本学习

【CVPR2020-北京大学】自适应间隔损失的提升小样本学习

专知会员服务

85+阅读 · 2020年6月9日

【WWW2020】解决推荐系统中目标客户失真问题，Addressing the Target Customer Distortion Problem in Recommender Systems

【WWW2020】解决推荐系统中目标客户失真问题，Addressing the Target Customer Distortion Problem in Recommender Systems

专知会员服务

10+阅读 · 2020年4月4日

【TPAMI2020】目标检测中的不平衡问题:综述论文，34页pdf

专知会员服务

55+阅读 · 2020年3月16日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【CVPR2020】CONSAC: 基于条件样本一致性的稳健多模型拟合，Robust Multi-Model Fitting by Conditional Sample Consensus

【CVPR2020】CONSAC: 基于条件样本一致性的稳健多模型拟合，Robust Multi-Model Fitting by Conditional Sample Consensus

专知会员服务

32+阅读 · 2020年2月24日

【综述】图像分类中的半监督、自监督和非监督技术综述相同点，不同点和组合

【综述】图像分类中的半监督、自监督和非监督技术综述相同点，不同点和组合

专知会员服务

49+阅读 · 2020年2月23日

【目标检测 | 2019最新综述】目标检测中的不平衡问题，附31页PDF， Imbalance Problems in Object Detection: A Review

【目标检测 | 2019最新综述】目标检测中的不平衡问题，附31页PDF， Imbalance Problems in Object Detection: A Review

专知会员服务

46+阅读 · 2019年11月15日

2019 DR loss（样本不平衡问题）目标检测论文阅读

2019 DR loss（样本不平衡问题）目标检测论文阅读

极市平台

11+阅读 · 2019年10月28日

机器学习计算距离和相似度的方法

机器学习计算距离和相似度的方法

极市平台

10+阅读 · 2019年9月20日

一行TensorFlow/Keras代码解决真实场景中数据不平衡(imbalanced)问题

一行TensorFlow/Keras代码解决真实场景中数据不平衡(imbalanced)问题

专知

78+阅读 · 2019年5月31日

非平衡数据集 focal loss 多类分类

非平衡数据集 focal loss 多类分类

AI研习社

33+阅读 · 2019年4月23日

深度学习训练数据不平衡问题，怎么解决？

深度学习训练数据不平衡问题，怎么解决？

AI研习社

7+阅读 · 2018年7月3日

深度学习任务面临非平衡数据问题？试试这个简单方法

深度学习任务面临非平衡数据问题？试试这个简单方法

数盟

6+阅读 · 2018年5月30日

教你简单解决过拟合问题（附公式）

教你简单解决过拟合问题（附公式）

数据派THU

5+阅读 · 2018年2月13日

方法总结：教你处理机器学习中不平衡类问题

方法总结：教你处理机器学习中不平衡类问题

专知

9+阅读 · 2018年2月7日

【干货】机器学习中样本比例不平衡的处理方法

【干货】机器学习中样本比例不平衡的处理方法

机器学习研究会

8+阅读 · 2018年1月14日

学员笔记||Python数据分析之：numpy入门（一）

学员笔记||Python数据分析之：numpy入门（一）

七月在线实验室

7+阅读 · 2017年9月28日

Meta-Learning with Implicit Gradients

Meta-Learning with Implicit Gradients

Arxiv

13+阅读 · 2019年9月10日

FoveaBox: Beyond Anchor-based Object Detector

Arxiv

5+阅读 · 2019年4月8日

Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression

Arxiv

6+阅读 · 2019年2月25日

Anomaly DetectionWith Multiple-Hypotheses Predictions

Arxiv

6+阅读 · 2019年1月28日

Few-shot Learning with Meta Metric Learners

Arxiv

13+阅读 · 2019年1月26日

On the loss of Fisher information in some multi-object tracking observation models

Arxiv

3+阅读 · 2018年3月26日

Stable Distribution Alignment Using the Dual of the Adversarial Distance

Arxiv

3+阅读 · 2018年1月30日

Joint Optic Disc and Cup Segmentation Based on Multi-label Deep Network and Polar Transformation

Arxiv

6+阅读 · 2018年1月11日

Brain Tumor Segmentation Based on Refined Fully Convolutional Neural Networks with A Hierarchical Dice Loss

Arxiv

4+阅读 · 2017年12月25日

Variance-based regularization with convex objectives

Arxiv

5+阅读 · 2017年12月14日

VIP会员

相关主题

相关VIP内容

基于改进卷积神经网络的短文本分类模型

基于改进卷积神经网络的短文本分类模型

专知会员服务

26+阅读 · 2020年7月22日

克服小样本学习中灾难性遗忘方法研究

专知会员服务

51+阅读 · 2020年7月16日

【KDD2020】最小方差采样用于图神经网络的快速训练

【KDD2020】最小方差采样用于图神经网络的快速训练

专知会员服务

28+阅读 · 2020年7月13日

【CVPR2020-北京大学】自适应间隔损失的提升小样本学习

【CVPR2020-北京大学】自适应间隔损失的提升小样本学习

专知会员服务

85+阅读 · 2020年6月9日

【WWW2020】解决推荐系统中目标客户失真问题，Addressing the Target Customer Distortion Problem in Recommender Systems

【WWW2020】解决推荐系统中目标客户失真问题，Addressing the Target Customer Distortion Problem in Recommender Systems

专知会员服务

10+阅读 · 2020年4月4日

【TPAMI2020】目标检测中的不平衡问题:综述论文，34页pdf

专知会员服务

55+阅读 · 2020年3月16日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【CVPR2020】CONSAC: 基于条件样本一致性的稳健多模型拟合，Robust Multi-Model Fitting by Conditional Sample Consensus

【CVPR2020】CONSAC: 基于条件样本一致性的稳健多模型拟合，Robust Multi-Model Fitting by Conditional Sample Consensus

专知会员服务

32+阅读 · 2020年2月24日

【综述】图像分类中的半监督、自监督和非监督技术综述相同点，不同点和组合

【综述】图像分类中的半监督、自监督和非监督技术综述相同点，不同点和组合

专知会员服务

49+阅读 · 2020年2月23日

【目标检测 | 2019最新综述】目标检测中的不平衡问题，附31页PDF， Imbalance Problems in Object Detection: A Review

【目标检测 | 2019最新综述】目标检测中的不平衡问题，附31页PDF， Imbalance Problems in Object Detection: A Review

专知会员服务

46+阅读 · 2019年11月15日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】扩展可扩展会话推荐的边界

别想太多：高效 R1 风格大型推理模型综述

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

智能体网络：用AI智能体编织下一代网络

相关资讯

2019 DR loss（样本不平衡问题）目标检测论文阅读

2019 DR loss（样本不平衡问题）目标检测论文阅读

极市平台

11+阅读 · 2019年10月28日

机器学习计算距离和相似度的方法

机器学习计算距离和相似度的方法

极市平台

10+阅读 · 2019年9月20日

一行TensorFlow/Keras代码解决真实场景中数据不平衡(imbalanced)问题

一行TensorFlow/Keras代码解决真实场景中数据不平衡(imbalanced)问题

专知

78+阅读 · 2019年5月31日

非平衡数据集 focal loss 多类分类

非平衡数据集 focal loss 多类分类

AI研习社

33+阅读 · 2019年4月23日

深度学习训练数据不平衡问题，怎么解决？

深度学习训练数据不平衡问题，怎么解决？

AI研习社

7+阅读 · 2018年7月3日

深度学习任务面临非平衡数据问题？试试这个简单方法

深度学习任务面临非平衡数据问题？试试这个简单方法

数盟

6+阅读 · 2018年5月30日

教你简单解决过拟合问题（附公式）

教你简单解决过拟合问题（附公式）

数据派THU

5+阅读 · 2018年2月13日

方法总结：教你处理机器学习中不平衡类问题

方法总结：教你处理机器学习中不平衡类问题

专知

9+阅读 · 2018年2月7日

【干货】机器学习中样本比例不平衡的处理方法

【干货】机器学习中样本比例不平衡的处理方法

机器学习研究会

8+阅读 · 2018年1月14日

学员笔记||Python数据分析之：numpy入门（一）

学员笔记||Python数据分析之：numpy入门（一）

七月在线实验室

7+阅读 · 2017年9月28日

相关论文

Meta-Learning with Implicit Gradients

Meta-Learning with Implicit Gradients

Arxiv

13+阅读 · 2019年9月10日

FoveaBox: Beyond Anchor-based Object Detector

Arxiv

5+阅读 · 2019年4月8日

Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression

Arxiv

6+阅读 · 2019年2月25日

Anomaly DetectionWith Multiple-Hypotheses Predictions

Arxiv

6+阅读 · 2019年1月28日

Few-shot Learning with Meta Metric Learners

Arxiv

13+阅读 · 2019年1月26日

On the loss of Fisher information in some multi-object tracking observation models

Arxiv

3+阅读 · 2018年3月26日

Stable Distribution Alignment Using the Dual of the Adversarial Distance

Arxiv

3+阅读 · 2018年1月30日

Joint Optic Disc and Cup Segmentation Based on Multi-label Deep Network and Polar Transformation

Arxiv

6+阅读 · 2018年1月11日

Brain Tumor Segmentation Based on Refined Fully Convolutional Neural Networks with A Hierarchical Dice Loss

Arxiv

4+阅读 · 2017年12月25日

Variance-based regularization with convex objectives

Arxiv

5+阅读 · 2017年12月14日

大家都在搜

大型语言模型

OpenKG开源系列 | 海洋鱼类百科知识图谱（浙江大学）

微信扫码咨询专知VIP会员