Spark Mllib之分层抽样

Spark中组件Mllib的学习之基础概念篇
1、解释
分层抽样的概念就不讲了，具体的操作：
RDD有个操作可以直接进行抽样：sampleByKey和sample等，这里主要介绍这两个
（1）将字符串长度为2划分为层2，字符串长度为3划分为层1，对层1和层2按不同的概率进行抽样
数据

aa

bb

cc

dd

ee

aaa

bbb

ccc

ddd

eee

比如：
val fractions: Map[Int, Double] = List((1, 0.2), (2, 0.8)).toMap //设定抽样格式
sampleByKey(withReplacement = false, fractions, 0)
fractions表示在层1抽0.2，在层2中抽0.8
withReplacement false表示不重复抽样
0表示随机的seed

源码：

 /**

   * Return a subset of this RDD sampled by key (via stratified sampling).

   *

   * Create a sample of this RDD using variable sampling rates for different keys as specified by

   * `fractions`, a key to sampling rate map, via simple random sampling with one pass over the

   * RDD, to produce a sample of size that's approximately equal to the sum of

   * math.ceil(numItems * samplingRate) over all key values.

   *

   * @param withReplacement whether to sample with or without replacement

   * @param fractions map of specific keys to sampling rates

   * @param seed seed for the random number generator

   * @return RDD containing the sampled subset

   */

  def sampleByKey(withReplacement: Boolean,

      fractions: Map[K, Double],

      seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope {

    require(fractions.values.forall(v => v >= 0.0), "Negative sampling rates.")

    val samplingFunc = if (withReplacement) {

      StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, false, seed)

    } else {

      StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, false, seed)

    }

    self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true)

  }

2、代码：

import org.apache.spark.{SparkConf, SparkContext}

object StratifiedSamplingLearning {

  def main(args: Array[String]) {

    val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass.getSimpleName.filter(!_.equals('$')))

    val sc = new SparkContext(conf)

    println("First:")

    val data = sc.textFile("D:\\TestData\\StratifiedSampling.txt") //读取数

      .map(row => {

      //开始处理

      if (row.length == ) //判断字符数

        (row, ) //建立对应map

      else (row, ) //建立对应map

    }).map(each => (each._2, each._1))

    data.foreach(println)

    println("sampleByKey:")

    val fractions: Map[Int, Double] = List((, 0.2), (, 0.8)).toMap //设定抽样格式

    val approxSample = data.sampleByKey(withReplacement = false, fractions, ) //计算抽样样本

    approxSample.foreach(println)

    println("Second:")

    val randRDD = sc.parallelize(List((, "cat"), (, "mouse"), (, "cup"), (, "book"), (, "tv"), (, "screen"), (, "heater")))

    val sampleMap = List((, 0.4), (, 0.8)).toMap

    val sample2 = randRDD.sampleByKey(false, sampleMap, ).collect

    sample2.foreach(println)

    println("Third:")

    val a = sc.parallelize( to , )

    val b = a.sample(true, 0.8, )

    val c = a.sample(false, 0.8, )

    println("RDD a : " + a.collect().mkString(" , "))

    println("RDD b : " + b.collect().mkString(" , "))

    println("RDD c : " + c.collect().mkString(" , "))

    sc.stop

  }

}

3、结果：

First:

(,aa)

(,bbb)

(,bb)

(,ccc)

(,cc)

(,ddd)

(,dd)

(,eee)

(,ee)

(,aaa)

sampleByKey:

(,aa)

(,bb)

(,cc)

(,ee)

Second:

(,cat)

(,mouse)

(,book)

(,screen)

(,heater)

Third:

RDD a :  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,

RDD b :  ,  ,  ,  ,  ,  ,  ,

RDD c :  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,

Spark Mllib之分层抽样的更多相关文章

Spark Mllib里的分层抽样（使用map作为分层抽样的数据标记）
不多说,直接上干货! 具体,见 Spark Mllib机器学习实战的第4章 Mllib基本数据类型和Mllib数理统计
《Spark MLlib机器学习实践》内容简介、目录
http://product.dangdang.com/23829918.html Spark作为新兴的.应用范围最为广泛的大数据处理开源框架引起了广泛的关注,它吸引了大量程序设计和开发人员进行相 ...
spark MLLib的基础统计部分学习
参考学习链接:http://www.itnose.net/detail/6269425.html 机器学习相关算法,建议初学者去看看斯坦福的机器学习课程视频:http://open.163.com/s ...
spark MLlib BasicStatistics 统计学基础
一, jar依赖,jsc创建. package ML.BasicStatistics; import com.google.common.collect.Lists; import org.apach ...
Spark MLlib 机器学习
本章导读机器学习(machine learning, ML)是一门涉及概率论.统计学.逼近论.凸分析.算法复杂度理论等多领域的交叉学科.ML专注于研究计算机模拟或实现人类的学习行为,以获取新知识.新 ...
Spark MLlib - LFW
val path = "/usr/data/lfw-a/*" val rdd = sc.wholeTextFiles(path) val first = rdd.first pri ...
Spark MLlib 之 Basic Statistics
Spark MLlib提供了一些基本的统计学的算法,下面主要说明一下: 1.Summary statistics 对于RDD[Vector]类型,Spark MLlib提供了colStats的统计方法 ...
Spark MLlib Data Type
MLlib 支持存放在单机上的本地向量和矩阵,也支持通过多个RDD实现的分布式矩阵.因此MLlib的数据类型主要分为两大类:一个是本地单机向量:另一个是分布式矩阵.下面分别介绍一下这两大类都有哪些类型 ...
Spark MLlib - Decision Tree源码分析
http://spark.apache.org/docs/latest/mllib-decision-tree.html 以决策树作为开始,因为简单,而且也比较容易用到,当前的boosting或ran ...

随机推荐

Gym 101149L Right Build
L. Right Build time limit per test 2.0 s memory limit per test 256 MB input standard input output st ...
启动weblogic报错：string value '2.4' is not a valid enumeration value for web-app-versionType in namespace http://java.sun.com/xml/ns/javaee
启动报错: 原因:有人改动了web.xml的头解决方法: 在web.xml中修改抬头为: <?xml version="1.0" encoding="UTF-8& ...
{MySQL存储引擎介绍}一存储引擎解释二 MySQL存储引擎分类三不同存储引擎的使用
MySQL存储引擎介绍 MySQL之存储引擎本节目录一存储引擎解释二 MySQL存储引擎分类三不同存储引擎的使用一存储引擎解释首先确定一点,存储引擎的概念是MySQL里面才有的,不是 ...
[No0000DD]C# StringEx 扩展字符串类类封装
using System; using System.Text.RegularExpressions; namespace Helpers { /// <summary> /// 包含常用 ...
2012年蓝桥杯省赛A组c++第4题(电视台答题比赛)
/* 某电视台举办了低碳生活大奖赛.题目的计分规则相当奇怪: 每位选手需要回答10个问题(其编号为1到10),越后面越有难度. 答对的,当前分数翻倍:答错了则扣掉与题号相同的分数(选手必须回答问题,不 ...
[administrative][CentOS][NetworkManager] networkmanager （二）
[administrative][CentOS][NetworkManager] 万恶的NetworkManager到底怎么用工程文档: https://wiki.gnome.org/Project ...
SQL instr()函数的格式
格式一:instr( string1, string2 ) / instr(源字符串, 目标字符串) 格式二:instr( string1, string2 [, start_positio ...
disruptor的并行用法
实现EventFactory,在newInstance方法中返回,ringBuffer缓冲区中的对象实例:代码如下: public class DTaskFactory implements Even ...
ps命令参数
1.查看父进程ps -ef |grep <进程名>在显示的输出中,第三列就是该进程的父进程PID,然后可以再使用ps命令来查看父进程的名称ps -ef |grep <父进程PID&g ...
nodejs 学习五单元测试一
一. chai chai 自身是依赖nodejs的 assert,让检测更加语义化. chai 采用两种模式,TDD和BDD, TDD是类似自然语言方式 BDD是结构主义 chai文旦地址二.moc ...

Spark Mllib之分层抽样

Spark Mllib之分层抽样的更多相关文章

随机推荐

热门专题