有时我们在实际分类数据挖掘中经常会遇到,类别样本很不均衡,直接使用这种不均衡数据会影响一些模型的分类效果,如logistic regression,SVM等,一种解决办法就是对数据进行均衡采样,这里就提供了一个建议代码实现,要求输入和输出数据格式为Label+Tab+Features, 如Libsvm format

-1 1:0.875 2:-1 3:-0.333333 4:-0.509434 5:-0.347032 6:-1 7:1 8:-0.236641 9:1 10:-0.935484 11:-1 12:-0.333333 13:-1
+1 1:0.166667 2:1 3:-0.333333 4:-0.433962 5:-0.383562 6:-1 7:-1 8:0.0687023 9:-1 10:-0.903226 11:-1 12:-1 13:1
+1 1:0.708333 2:1 3:1 4:-0.320755 5:-0.105023 6:-1 7:1 8:-0.419847 9:-1 10:-0.225806 12:1 13:-1
-1 1:0.583333 2:-1 3:0.333333 4:-0.603774 5:1 6:-1 7:1 8:0.358779 9:-1 10:-0.483871 12:-1 13:1

用法 Usage:

Usage: {0} [options] dataset subclass_size [output]
options:
-s method : method of selection (default 0)
0 -- over-sampling & under-sampling given subclass_size
1 -- over-sampling (subclass_size: any value)
2 -- under-sampling(subclass_size: any value)

Bash example:

python SampleDataset.py -s 0 heart_scale 20 heart_scale.txt

这里s参数表示抽样的方法,

-s 0:Over sampling &Under sampling ,即对类别多的进行降采样,对类别少的进行重采样

-s 1: Over sampling 对类别少的进行重采样,采样后的每类样本数与最多的那一类一致

-s 2:Under sampling 对类别多的进行降采样,采样后的每类样本数与最少的那一类一值

输入数据文件heart_scale

输出数据文件heart_scale.txt

下面是代码文件:SampleDataset.py:

#!/usr/bin/env python
from sklearn.datasets import load_svmlight_file
from sklearn.datasets import dump_svmlight_file
import numpy as np
from sklearn.utils import check_random_state
from scipy.sparse import hstack,vstack
import os, sys, math, random
from collections import defaultdict
if sys.version_info[0] >= 3:
xrange = range def exit_with_help(argv):
print("""\
Usage: {0} [options] dataset subclass_size [output]
options:
-s method : method of selection (default 0)
0 -- over-sampling & under-sampling given subclass_size
1 -- over-sampling (subclass_size: any value)
2 -- under-sampling(subclass_size: any value) output : balance set file (optional)
If output is omitted, the subset will be printed on the screen.""".format(argv[0]))
exit(1) def process_options(argv):
argc = len(argv)
if argc < 3:
exit_with_help(argv) # default method is over-sampling & under-sampling
method = 0
BalanceSet_file = sys.stdout i = 1
while i < argc:
if argv[i][0] != "-":
break
if argv[i] == "-s":
i = i + 1
method = int(argv[i])
if method not in [0,1,2]:
print("Unknown selection method {0}".format(method))
exit_with_help(argv)
i = i + 1 dataset = argv[i]
BalanceSet_size = int(argv[i+1]) if i+2 < argc:
BalanceSet_file = open(argv[i+2],'w') return dataset, BalanceSet_size, method, BalanceSet_file def stratified_selection(dataset, subset_size, method):
labels = [line.split(None,1)[0] for line in open(dataset)]
label_linenums = defaultdict(list)
for i, label in enumerate(labels):
label_linenums[label] += [i] l = len(labels)
remaining = subset_size
ret = [] # classes with fewer data are sampled first;
label_list = sorted(label_linenums, key=lambda x: len(label_linenums[x]))
min_class = label_list[0]
maj_class = label_list[-1]
min_class_num = len(label_linenums[min_class])
maj_class_num = len(label_linenums[maj_class])
random_state = check_random_state(42) for label in label_list:
linenums = label_linenums[label]
label_size = len(linenums)
if method == 0:
if label_size<subset_size:
ret += linenums
subnum = subset_size-label_size
else:
subnum = subset_size
ret += [linenums[i] for i in random_state.randint(low=0, high=label_size,size=subnum)]
elif method == 1:
if label == maj_class:
ret += linenums
continue
else:
ret += linenums
subnum = maj_class_num-label_size
ret += [linenums[i] for i in random_state.randint(low=0, high=label_size,size=subnum)]
elif method == 2:
if label == min_class:
ret += linenums
continue
else:
subnum = min_class_num
ret += [linenums[i] for i in random_state.randint(low=0, high=label_size,size=subnum)]
random.shuffle(ret)
return ret def main(argv=sys.argv):
dataset, subset_size, method, subset_file = process_options(argv)
selected_lines = [] selected_lines = stratified_selection(dataset, subset_size,method) #select instances based on selected_lines
dataset = open(dataset,'r')
datalist = dataset.readlines()
for i in selected_lines:
subset_file.write(datalist[i])
subset_file.close() dataset.close() if __name__ == '__main__':
main(sys.argv)

Sample a balance dataset from imbalance dataset and save it(从不平衡数据中抽取平衡数据,并保存)的更多相关文章

  1. Compute Mean Value of Train and Test Dataset of Caltech-256 dataset in matlab code

    Compute Mean Value of Train and Test Dataset of Caltech-256 dataset in matlab code clc;imPath = '/ho ...

  2. XML与DataSet相互转换,DataSet查询

    以FileShare.Read形式读XML文件: string hotspotXmlStr = string.Empty; try { Stream fileStream = new FileStre ...

  3. Spark:几种给Dataset增加列的方式、Dataset删除列、Dataset替换null列

    几种给Dataset增加列的方式 首先创建一个DF对象: scala> spark.version res0: String = .cloudera1 scala> val , , 2.0 ...

  4. 黑马程序员_ADO.Net(ExecuteReader,Sql注入与参数添加,DataSet,总结DataSet与SqlDataReader )

    转自https://blog.csdn.net/u010796875/article/details/17386131 一.执行有多行结果集的用ExecuteReader SqlDateReader  ...

  5. cannot use the same dataset for report.dataset and page.dataset

    把page中的dataset中填的数据表删除.(改成not assigned)

  6. 什么叫强类型的DATASET ?对DATASET的操作处理?强类型DataSet的使用简明教程

    强类型DataSet,是指需要预先定义对应表的各个字段的属性和取值方式的数据集.对于所有这些属性都需要从DataSet, DataTable, DataRow继承,生成相应的用户自定义类.强类型的一个 ...

  7. (原)强类型dataset(类型化dataset)中动态修改查询条件(不确定参数查询)

    原创博客,转载请注明:http://www.cnblogs.com/albert1017/p/3361932.html 查询时有多个参数,参数个数由客户输入决定,不能确定有多少个参数,按一般的方法每种 ...

  8. python概念-常用模块之究竟你是什么鬼

    模块: 一个模块就是一个包含了python定义和声明的文件,文件名就是模块名字加上.py的后缀. 说白了,就是一个python文件中定义好了类和方法,实现了一些功能,可以被别的python文件所调用 ...

  9. Spark提高篇——RDD/DataSet/DataFrame(一)

    该部分分为两篇,分别介绍RDD与Dataset/DataFrame: 一.RDD 二.DataSet/DataFrame 先来看下官网对RDD.DataSet.DataFrame的解释: 1.RDD ...

随机推荐

  1. Thinkphp学习回顾(二)之config.php的配置

    常见配置项 <? return array( //'配置项'=>'配置值' 'TMPL_L_DELIM'=>'<{', //修改左定界符,防止其与js中的代码重合,发生造成问题 ...

  2. xcode8+iOS10问题

    .xcode升级到8.0后打印的问题 ()xcode8会打印一些莫名其妙的log 解决方法:Scheme里面添加OS_ACTIVITY_MODE = disable ()xcode8打印log不完整 ...

  3. Xamarin.Forms listview中的button按钮,实现带着参数返回上一级页面

    今天在做列表显示的时候遇到一个问题,就是在ListView中如何才能让一个button的按钮工作并且包含参数呢? 其实有点类似于rep里的控件无法起获取一样.在Xamarin中,当你button绑定事 ...

  4. DirectDraw创建Windows窗口

    KWindow.h  KWindow.cpp KDDrawWindow.cpp #define STRICT #define WIN32_LEAN_AND_MEAN #include <wind ...

  5. Intellij jrebel 热部署

    Intellij 14破解下载 注册机 即可进行破解.JRebel安装下载IntelliJ IDEA的 JRebel插件: jr-ide-idea-6.2.0-idea-13-14.zip. 打开In ...

  6. 渗透杂记-2013-07-13 关于SMB版本的扫描

    smb2的溢出,其实在metasploit里面有两个扫描器可以用,效果都差不多,只是一个判断的更加详细,一个只是粗略的判断. Welcome to the Metasploit Web Console ...

  7. How To Install Java on CentOS and Fedora

    PostedDecember 4, 2014 453.8kviews JAVA CENTOS FEDORA   Introduction This tutorial will show you how ...

  8. ognl.OgnlException: target is null for setProperty(null, "emailTypeNo", [Ljava.lang.String;@1513fd0)

    [com.opensymphony.xwork2.ognl.OgnlValueStack] - Error setting expression 'emaiTypeDto.emailTypeNo' w ...

  9. 求System.arraycopy的用法

    public class Shuzufuzhi { public static void main(String args[]) {  int myArray[]={1,2,3,4,5,6};  in ...

  10. 准备阶段-mongodb数据库安装

    具体安装步骤,请参阅 mongoDB(win7_64位)使用手册1.0