Using SMOTEBoost and RUSBoost to deal with class imbalance

from:https://aitopics.org/doc/news:1B9F7A99/

Binary classification with strong class imbalance can be found in many real-world classification problems. From trying to predict events such as network intrusion and bank fraud to a patient's medical diagnosis, the goal in these cases is to be able to identify instances of the minority class -- that is, the class that is underrepresented in the dataset. This, of course, presents a big challenge as most predictive models tend to ignore the more critical minority class while deceptively giving high accuracy results by favoring the majority class. Several techniques have been used to get around the problem of class imbalance, including different sampling methods and modeling algorithms. Examples of sampling methods include adding data samples to the minority class by either duplicating the data or generating synthetic minority samples (oversampling), or randomly removing majority class data to produce a more balanced data distribution (undersampling).

kaggle上建议：摘自：https://www.kaggle.com/general/7793

One approach that I have used in the past was to use a clustering algorithm for the majority class, the negative cases in your example, to create a smaller, representative set of negative cases with which to replace the actual negative cases in the training data.

More specifically, I used K-means algorithm to cluster the negative cases into the x clusters where x is the number of positive examples in my training data. Then I used the cluster centroids as the negative cases and the actual positive cases as the positives. This gave me a 50 / 50 balanced data set for the training. (Note that I only did this for the training data. My cross validation and test sets I kept unclustered).

There are downsides to this approach:

By clustering you lose some accuracy in the negative cases, but that is the price you pay with this approach.
Also, K-means algorithm can converge on a different set of centroids each time you run it as it starts at random starting positions. In optimising the model, I regenerated the negative centroids several times to get different data for the training.

其实rusboost还是很先进的，见下文：

CUSBoost：基于聚类的提升下采样的非平衡数据分类

原论文地址：CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

Abstract

普通的机器学习方法，对于非平衡数据分类，总是倾向于最大化占比多的类别的分类准确率，而把占比少的类别分类错误，但是，现实应用中，我们研究的问题，对于少数的类别却更加感兴趣。最近，处理非平衡数据分类问题的方法有：采样方法，成本敏感的学习方法，以及集成学习的方法。这篇文章中，提出了一种新的基于聚类的欠采样boosting方法，CUSBoost，它能够有效地处理非平衡数据分类问题。RUSBoost(random under-sampling with AdaBoost) 和SMOTEBoost (synthetic minority over-sampling with AdaBoost) 算法，在我们提出的算法中作为可选项。经过实验，我们发现CUSBoost算法在处理非平衡数据上能够达到state-of-art的表现，其表现优于一般的集成学习方法。

Introduction

处理非平衡类别问题的方法一般被分为两类：外部（对非平衡数据进行处理得到平衡的数据）、内部（通过降低非平衡类别数据的灵敏度来改变已有的学习算法）的方法。

而CUSBoost的处理方法是：首先把数据分开为少数类别实例和多数类别实例，然后使用K-means算法对多数类别实例进行聚类处理，并且从每个聚类中选择部分数据来组成平衡的数据。聚类的方法帮助我们在多数类别数据中选择了差异性更大的数据（同一个聚类里面的数据则选择的相对较少），比起那些随机采样的方法（随机丢弃多数类别的数据）。CUSBoost combines the sampling and boosting methods to form an efficient and effective algorithm for class imbalance learning.

Related work

Sun等人提出的处理非平衡的二分类问题的方法，首先将大多数类别数据随机分组，每个组内的数据数量和少数类别的数据数量相近。然后由多数类每个组的数据加少数类数据进而组成平衡的数据样本。

Chawla 等人提出一种过采样方法SMOTE，对少数类别数据进行over-sample，不同之处在于，采样的数据是由少数类别数据合成而来。它通过操作特征空间而不是数据空间来合成少数类别的数据（使用KNN）。

Seiffert等人给出的结合Adaboost的随机欠采样方法RUSBoost，RUS减少多数类别的数据组成平衡数据（结合Adaboost）。

CUSBoost Algorithm

CUSBoost是聚类采样和Adaboost方法的结合：
聚类采样：把多数类数据和少数类数据分开，在多数类数据中，使用K-means算法将其分为K

个聚类（K

采用超参数优化决定）。然后在每个聚类中，使用随机地选择50%的数据（这里可以视具体问题进行调整）。使用选择出来的数据和少数类数据一起组成新的平衡数据。

算法伪代码：

解释一下：步骤三是每一次都实行一次欠采样（对事先已经利用K-means方法得到的K个聚类），组成平衡的数据。

My views

本文提出的算法关键点在于基于聚类的欠采样方法（K-means），然后就是结合Adaboost算法来得到最后的模型。文章后面也给出了实验测试结果，有兴趣的可以看一下，可以得到一下几个结论：

总体相比RUSBoost 和SMOTEBoost来说，该方法分类性能具有显著优越的效果；

正是基于聚类的采样方法，导致了如果数据的特征空间非常适合聚类的时候，该方法将表现地较好。否则，可能需要尝试其他方法了；

该算法结合了Adaboost，其实用其他算法模型来代替也不是不可以的，比如当前很火的Xgboost。

Using SMOTEBoost(过采样) and RUSBoost（使用聚类+集成学习） to deal with class imbalance的更多相关文章

OpenCV计算机视觉学习（12）——图像量化处理&图像采样处理（K-Means聚类量化，局部马赛克处理）
如果需要处理的原图及代码,请移步小编的GitHub地址传送门:请点击我如果点击有误:https://github.com/LeBron-Jian/ComputerVisionPractice 准备 ...
Self-paced Clustering Ensemble自步聚类集成论文笔记
Self-paced Clustering Ensemble自步聚类集成论文笔记 2019-06-23 22:20:40 zpainter 阅读数 174 收藏更多分类专栏: 论文版权声明 ...
多视图子空间聚类/表示学习(Multi-view Subspace Clustering/Representation Learning)
多视图子空间聚类/表示学习(Multi-view Subspace Clustering/Representation Learning) 作者:凯鲁嘎吉 - 博客园 http://www.cnblo ...
聚类算法学习-kmeans，kmedoids，GMM
GMM参考这篇文章:Link 简单地说,k-means 的结果是每个数据点被 assign 到其中某一个 cluster 了,而 GMM 则给出这些数据点被 assign 到每个 cluster 的概 ...
K-Means聚类算法原理
K-Means算法是无监督的聚类算法,它实现起来比较简单,聚类效果也不错,因此应用很广泛.K-Means算法有大量的变体,本文就从最传统的K-Means算法讲起,在其基础上讲述K-Means的优化变体 ...
ML: 聚类算法-概论
聚类分析是一种重要的人类行为,早在孩提时代,一个人就通过不断改进下意识中的聚类模式来学会如何区分猫狗.动物植物.目前在许多领域都得到了广泛的研究和成功的应用,如用于模式识别.数据分析.图像处理.市场研 ...
03-01 K-Means聚类算法
目录 K-Means聚类算法一.K-Means聚类算法学习目标二.K-Means聚类算法详解 2.1 K-Means聚类算法原理 2.2 K-Means聚类算法和KNN 三.传统的K-Means聚 ...
机器学习（十）—聚类算法（KNN、Kmeans、密度聚类、层次聚类）
聚类算法任务:将数据集中的样本划分成若干个通常不相交的子集,对特征空间的一种划分. 性能度量:类内相似度高,类间相似度低.两大类:1.有参考标签,外部指标:2.无参照,内部指标. 距离计算:非负性, ...
K-Means 聚类算法
K-Means 概念定义: K-Means 是一种基于距离的排他的聚类划分方法. 上面的 K-Means 描述中包含了几个概念: 聚类(Clustering):K-Means 是一种聚类分析(Clus ...

随机推荐

CentOS 6.4 yum安装LAMP环境
一.制作连外网的yum源文件 1. centOS安装完成时是默认存在的,不需要做任何操作,可以直接使用yum 命令进行操作, 默认是在 /etc/yum.repos.d/目录下的 2. 如果你因为制 ...
XML5632 : Only one root element is allowed. Line: 1, Column 1
奇葩啊, 最后查出来是因为有一个svg文件名对不上...
js浅度克隆/深度克隆
首先弄明白几个概念: 一. 具体数据类型分为两种: 1.原始数据类型 2.引用数据类型原始数据类型存储的是对象的实际地址,包括: number.string.boolean.还有两个特殊的nul ...
微信小程序的官方文档
虽然不知道微信小程序今后的发展情况,不过做为一名it人员的我还是去了解它. 这是他的文档路径,里面有详细的使用和申请内测号的全部流程,这里就不再过多解释了. 看后那个开发小程序的文档记得分析你感觉微信 ...
idangerous swiper
最近使用Swipe.js,发现中文的资料很少,试着翻译了一下.能力有限,翻译难免错漏,欢迎指出,多谢! 翻译自:http://www.idangero.us/sliders/swiper/api.ph ...
Pell方程(求形如x*x-d*y*y=1的通解。)
佩尔方程x*x-d*y*y=1,当d不为完全平方数时,有无数个解,并且知道一个解可以推其他解. 如果d为完全平方数时,可知佩尔方程无解. 假设(x0,y0)是最小正整数解. 则: xn=xn-1*x0 ...
SVN流程图协作图
c++的检测的确比C++更严格
见下面代码 #include <stdio.h> #include <stdlib.h> #include <time.h> enum guess { paper, ...
ArcGIS Scalebar 比例尺
说明.这篇博文的示例代码地图充满body arcgis api for javascript iis怎么离线部署请参考我前面的博文 1.运行效果 3.HTML代码 <!DOCTYPE htm ...
九度OJ 1181：遍历链表（链表、排序）
时间限制:1 秒内存限制:32 兆特殊判题:否提交:2733 解决:1181 题目描述: 建立一个升序链表并遍历输出. 输入: 输入的每个案例中第一行包括1个整数:n(1<=n<=1 ...

Using SMOTEBoost(过采样) and RUSBoost（使用聚类+集成学习） to deal with class imbalance