VectorAssembler字段转换成特征向量

import org.apache.spark.ml.feature.VectorAssembler

val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating")

// 字段转换成特征向量
val assembler = new VectorAssembler().setInputCols(colArray).setOutputCol("features") val vecDF: DataFrame = assembler.transform(data)
vecDF: org.apache.spark.sql.DataFrame = [affairs: double, gender: string ... 8 more fields] vecDF.select("features", colArray: _*).show(10, truncate = false)
+----------------------------+----+------------+-------------+---------+----------+------+
|features |age |yearsmarried|religiousness|education|occupation|rating|
+----------------------------+----+------------+-------------+---------+----------+------+
|[37.0,10.0,3.0,18.0,7.0,4.0]|37.0|10.0 |3.0 |18.0 |7.0 |4.0 |
|[27.0,4.0,4.0,14.0,6.0,4.0] |27.0|4.0 |4.0 |14.0 |6.0 |4.0 |
|[32.0,15.0,1.0,12.0,1.0,4.0]|32.0|15.0 |1.0 |12.0 |1.0 |4.0 |
|[57.0,15.0,5.0,18.0,6.0,5.0]|57.0|15.0 |5.0 |18.0 |6.0 |5.0 |
|[22.0,0.75,2.0,17.0,6.0,3.0]|22.0|0.75 |2.0 |17.0 |6.0 |3.0 |
|[32.0,1.5,2.0,17.0,5.0,5.0] |32.0|1.5 |2.0 |17.0 |5.0 |5.0 |
|[22.0,0.75,2.0,12.0,1.0,3.0]|22.0|0.75 |2.0 |12.0 |1.0 |3.0 |
|[57.0,15.0,2.0,14.0,4.0,4.0]|57.0|15.0 |2.0 |14.0 |4.0 |4.0 |
|[32.0,15.0,4.0,16.0,1.0,2.0]|32.0|15.0 |4.0 |16.0 |1.0 |2.0 |
|[22.0,1.5,4.0,14.0,4.0,5.0] |22.0|1.5 |4.0 |14.0 |4.0 |5.0 |
+----------------------------+----+------------+-------------+---------+----------+------+
only showing top 10 rows

VectorIndexer自动识别分类的特征,并对它们进行索引

import org.apache.spark.ml.feature.VectorIndexer

val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating") 

// 自动识别分类的特征,并对它们进行索引
// 具有大于7个不同的值的特征被视为连续。
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(7)
.fit(vecDF) val categoricalFeatures: Set[Int] = featureIndexer.categoryMaps.keys.toSet
categoricalFeatures: Set[Int] = Set(2, 3, 4, 5) println(s"Chose ${categoricalFeatures.size} categorical features: " +
categoricalFeatures.mkString(", "))
Chose 4 categorical features: 2, 3, 4, 5 // 由此看出,当MaxCategories=7,从6个字段中识别出了4个“类别特征字段”,
// 他们的下标索引为(2, 3, 4, 5),分别对应colArray中的(2, 3, 4, 5)元素,即"religiousness", "education", "occupation", "rating"
// 为什么识别出了4个“类别特征字段”呢,请看本人博客http://www.cnblogs.com/wwxbi/p/6125363.html“统计字段中元素的个数”
// 从“统计字段中元素的个数”看出,("religiousness", "education", "occupation", "rating")这4个字段的元素个数<=7 // Create new column "indexedFeatures" with categorical values transformed to indices
val indexedData = featureIndexer.transform(vecDF)
indexedData: org.apache.spark.sql.DataFrame = [affairs: double, gender: string ... 9 more fields] val resColArray = Array("indexedFeatures", "features", "age", "yearsmarried", "religiousness", "education", "occupation", "rating")
resColArray: Array[String] = Array(indexedFeatures, features, age, yearsmarried, religiousness, education, occupation, rating) indexedData.selectExpr(resColArray: _*).show(10, truncate = false)
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
|indexedFeatures |features |age |yearsmarried|religiousness|education|occupation|rating|
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
|[37.0,10.0,2.0,5.0,6.0,3.0]|[37.0,10.0,3.0,18.0,7.0,4.0]|37.0|10.0 |3.0 |18.0 |7.0 |4.0 |
|[27.0,4.0,3.0,2.0,5.0,3.0] |[27.0,4.0,4.0,14.0,6.0,4.0] |27.0|4.0 |4.0 |14.0 |6.0 |4.0 |
|[32.0,15.0,0.0,1.0,0.0,3.0]|[32.0,15.0,1.0,12.0,1.0,4.0]|32.0|15.0 |1.0 |12.0 |1.0 |4.0 |
|[57.0,15.0,4.0,5.0,5.0,4.0]|[57.0,15.0,5.0,18.0,6.0,5.0]|57.0|15.0 |5.0 |18.0 |6.0 |5.0 |
|[22.0,0.75,1.0,4.0,5.0,2.0]|[22.0,0.75,2.0,17.0,6.0,3.0]|22.0|0.75 |2.0 |17.0 |6.0 |3.0 |
|[32.0,1.5,1.0,4.0,4.0,4.0] |[32.0,1.5,2.0,17.0,5.0,5.0] |32.0|1.5 |2.0 |17.0 |5.0 |5.0 |
|[22.0,0.75,1.0,1.0,0.0,2.0]|[22.0,0.75,2.0,12.0,1.0,3.0]|22.0|0.75 |2.0 |12.0 |1.0 |3.0 |
|[57.0,15.0,1.0,2.0,3.0,3.0]|[57.0,15.0,2.0,14.0,4.0,4.0]|57.0|15.0 |2.0 |14.0 |4.0 |4.0 |
|[32.0,15.0,3.0,3.0,0.0,1.0]|[32.0,15.0,4.0,16.0,1.0,2.0]|32.0|15.0 |4.0 |16.0 |1.0 |2.0 |
|[22.0,1.5,3.0,2.0,3.0,4.0] |[22.0,1.5,4.0,14.0,4.0,5.0] |22.0|1.5 |4.0 |14.0 |4.0 |5.0 |
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
only showing top 10 rows import org.apache.spark.ml.feature.VectorSlicer val slicer = new VectorSlicer().setInputCol("indexedFeatures").setOutputCol("slicerFeatures")
slicer.setIndices(Array(3)) // 此处的3对应“索引化”之前的字段“education” val output = slicer.transform(indexedData)
output.select("indexedFeatures",
"slicerFeatures",
"education").limit(10).orderBy($"education").show(10, truncate = false)
+---------------------------+--------------+---------+
|indexedFeatures |slicerFeatures|education|
+---------------------------+--------------+---------+
|[32.0,15.0,0.0,1.0,0.0,3.0]|[1.0] |12.0 |
|[22.0,0.75,1.0,1.0,0.0,2.0]|[1.0] |12.0 |
|[27.0,4.0,3.0,2.0,5.0,3.0] |[2.0] |14.0 |
|[57.0,15.0,1.0,2.0,3.0,3.0]|[2.0] |14.0 |
|[22.0,1.5,3.0,2.0,3.0,4.0] |[2.0] |14.0 |
|[32.0,15.0,3.0,3.0,0.0,1.0]|[3.0] |16.0 |
|[32.0,1.5,1.0,4.0,4.0,4.0] |[4.0] |17.0 |
|[22.0,0.75,1.0,4.0,5.0,2.0]|[4.0] |17.0 |
|[37.0,10.0,2.0,5.0,6.0,3.0]|[5.0] |18.0 |
|[57.0,15.0,4.0,5.0,5.0,4.0]|[5.0] |18.0 |
+---------------------------+--------------+---------+
// 由此看出,“类别特征字段”被索引化后,索引的编号是跟“原字段值的大小顺序”对照的,索引从0开始
// 索引编号(0,1,2,3,4,5,6)对应[9.0, 12.0, 14.0, 16.0, 17.0, 18.0, 20.0]

VectorSlicer向量切割

import org.apache.spark.ml.feature.VectorSlicer

val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating") 

// 字段转换成特征向量
val assembler = new VectorAssembler().setInputCols(colArray).setOutputCol("features")
val vecDF = assembler.transform(data) val slicer = new VectorSlicer().setInputCol("features").setOutputCol("slicerFeatures")
// 指定“向量字段features”中的下标索引
// (2, 3, 4)分别对应字段("religiousness", "education", "occupation")
slicer.setIndices(Array(2, 3, 4)) val output = slicer.transform(vecDF)
output.select("features", "slicerFeatures","religiousness", "education", "occupation").show(10, truncate = false)
+----------------------------+--------------+-------------+---------+----------+
|features |slicerFeatures|religiousness|education|occupation|
+----------------------------+--------------+-------------+---------+----------+
|[37.0,10.0,3.0,18.0,7.0,4.0]|[3.0,18.0,7.0]|3.0 |18.0 |7.0 |
|[27.0,4.0,4.0,14.0,6.0,4.0] |[4.0,14.0,6.0]|4.0 |14.0 |6.0 |
|[32.0,15.0,1.0,12.0,1.0,4.0]|[1.0,12.0,1.0]|1.0 |12.0 |1.0 |
|[57.0,15.0,5.0,18.0,6.0,5.0]|[5.0,18.0,6.0]|5.0 |18.0 |6.0 |
|[22.0,0.75,2.0,17.0,6.0,3.0]|[2.0,17.0,6.0]|2.0 |17.0 |6.0 |
|[32.0,1.5,2.0,17.0,5.0,5.0] |[2.0,17.0,5.0]|2.0 |17.0 |5.0 |
|[22.0,0.75,2.0,12.0,1.0,3.0]|[2.0,12.0,1.0]|2.0 |12.0 |1.0 |
|[57.0,15.0,2.0,14.0,4.0,4.0]|[2.0,14.0,4.0]|2.0 |14.0 |4.0 |
|[32.0,15.0,4.0,16.0,1.0,2.0]|[4.0,16.0,1.0]|4.0 |16.0 |1.0 |
|[22.0,1.5,4.0,14.0,4.0,5.0] |[4.0,14.0,4.0]|4.0 |14.0 |4.0 |
+----------------------------+--------------+-------------+---------+----------+
only showing top 10 rows output.printSchema()
root
|-- affairs: double (nullable = false)
|-- gender: string (nullable = true)
|-- age: double (nullable = false)
|-- yearsmarried: double (nullable = false)
|-- children: string (nullable = true)
|-- religiousness: double (nullable = false)
|-- education: double (nullable = false)
|-- occupation: double (nullable = false)
|-- rating: double (nullable = false)
|-- features: vector (nullable = true)
|-- slicerFeatures: vector (nullable = true)

Bucketizer将连续数据离散化到指定的范围区间

import org.apache.spark.ml.feature.Bucketizer

// Double.NegativeInfinity:负无穷;Double.PositiveInfinity:正无穷
// 分为6个组:[负无穷,-100),[-100,-10),[-10,0),[0,10),[10,90),[90,正无穷)
val splits = Array(Double.NegativeInfinity, -100, -10, 0.0, 10, 90, Double.PositiveInfinity) val data: Array[Double] = Array(-180,-160,-100,-50,-70,-20,-8,-5,-3, 0.0, 1,3,7,10,30,60,90,100,120,150) val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
dataFrame: org.apache.spark.sql.DataFrame = [features: double] val bucketizer = new Bucketizer()
.setInputCol("features")
.setOutputCol("bucketedFeatures")
.setSplits(splits) // 将原始数据转换为桶索引
val bucketedData = bucketizer.transform(dataFrame)
bucketedData: org.apache.spark.sql.DataFrame = [features: double, bucketedFeatures: double] bucketedData.show(50,truncate=false)
+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
|-180.0 |0.0 |
|-160.0 |0.0 |
|-100.0 |1.0 |
|-50.0 |1.0 |
|-70.0 |1.0 |
|-20.0 |1.0 |
|-8.0 |2.0 |
|-5.0 |2.0 |
|-3.0 |2.0 |
|0.0 |3.0 |
|1.0 |3.0 |
|3.0 |3.0 |
|7.0 |3.0 |
|10.0 |4.0 |
|30.0 |4.0 |
|60.0 |4.0 |
|90.0 |5.0 |
|100.0 |5.0 |
|120.0 |5.0 |
|150.0 |5.0 |
+--------+----------------+

Spark特征(提取,转换,选择)extracting, transforming and selecting features的更多相关文章

  1. Spark Extracting,transforming,selecting features

    Spark(3) - Extracting, transforming, selecting features 官方文档链接:https://spark.apache.org/docs/2.2.0/m ...

  2. Extracting and composing robust features with denosing autoencoders 论文

    这是一篇发表于2008年初的论文. 文章主要讲了利用 denosing autoencoder来学习 robust的中间特征..进上步,说明,利用这个方法,可以初始化神经网络的权值..这就相当于一种非 ...

  3. 【Spark篇】---Spark中Transformations转换算子

    一.前述 Spark中默认有两大类算子,Transformation(转换算子),懒执行.action算子,立即执行,有一个action算子 ,就有一个job. 通俗些来说由RDD变成RDD就是Tra ...

  4. 【转】Spark实现行列转换pivot和unpivot

    背景 做过数据清洗ETL工作的都知道,行列转换是一个常见的数据整理需求.在不同的编程语言中有不同的实现方法,比如SQL中使用case+group,或者Power BI的M语言中用拖放组件实现.今天正好 ...

  5. Spark中RDD转换成DataFrame的两种方式(分别用Java和Scala实现)

    一:准备数据源     在项目下新建一个student.txt文件,里面的内容为: ,zhangsan, ,lisi, ,wanger, ,fangliu, 二:实现 Java版: 1.首先新建一个s ...

  6. Spark之 RDD转换成DataFrame的Scala实现

    依赖 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2. ...

  7. Android oat文件提取转换

    说明: 1.手机厂商可以修改Android源码并进行编译后再生成oat格式文件在手机上存储,比如boot-okhttp.oat,boot-framework.oat. 2.自带的apk可以调用这些模块 ...

  8. 【spark】常用转换操作:join

    join就表示内连接. 对于内链接,对于给定的两个输入数据集(k,v1)和(k,v2) 根据相同的k进行连接,最终得到(k,(v1,v2))的数据集. 示例 val arr1 = Array((&qu ...

  9. 【spark】常用转换操作:keys 、values和mapValues

    1.keys 功能: 返回所有键值对的key 示例 val list = List("hadoop","spark","hive",&quo ...

随机推荐

  1. mysql触发器应用和创建表错误代码: 1118 Row size too large. 解决

    1.针对数据库查询问题的方便,可以建立重要表的log备份记录表,在主表的添加,修改,删除添加触发器,修改触发器增加触发字段的点,限制条件. 数据库log表查问题比从线上多台服务器上下载日志文件相对方便 ...

  2. Python游戏《外星人入侵》来了~

    在游戏<外星人入侵>中,玩家控制着一艘最初出现在屏幕底部中央的飞船.玩家可以使用箭头键左右移动飞船,还可使用空格键进行射击.游戏开始时,一群外星人出现在天空中,他们在屏幕中向下移动.玩家的 ...

  3. Reg命令使用详解 批处理操作注册表必备

    首先要说明:编辑注册表不当可能会严重损坏您的系统.在更改注册表之前,应备份计算机上任何有价值的数据 只有在别无选择的情况下,才直接编辑注册表.注册表编辑器会忽略标准的安全措施,从而使得这些设置会降低性 ...

  4. ios7注意事项随笔

    1,修改状态栏的样式和隐藏. 首先,需要在Info.plist配置文件中,增加键:UIViewControllerBasedStatusBarAppearance,并设置为YES: 然后,在UIVie ...

  5. 6.824 Lab 5: Caching Extents

    Introduction In this lab you will modify YFS to cache extents, reducing the load on the extent serve ...

  6. 使用Ajax long polling实现简单的聊天程序

    关于web实时通信,通常使用长轮询或这长连接方式进行实现. 为了能够实际体会长轮询,通过Ajax长轮询实现了一个简单的聊天程序,在此作为笔记. 长轮询 传统的轮询方式是,客户端定时(一般使用setIn ...

  7. linux环境下执行RF测试脚本

    1. 测试执行 测试管理平台需根据用户选中的测试案例,按照相应格式对执行启动命令进行组装和发送,触动案例的自动化测试执行. 命令格式根据测试模式(以项目为单位.以测试集为单位.以案例为单位)具有不同的 ...

  8. 【RF库XML测试】Get Element Text

    Name:Get Element TextSource:XML <test library>Arguments:[ source | xpath=. | normalize_whitesp ...

  9. Django 定义数据模型

    如何定义数据模型: (1) 在 MVC 设计模式中,M 表示数据模型 ( Model ),负责业务对象与数据库的映射,我们可以通过应用的 models.py 来定义数据模型(2) Model 采用了 ...

  10. Linux 下 c 语言 聊天软件

    这是我学C语言写的第一个软件,是一个完整的聊天软件,里面包括客户端,和服务器端,可以互现聊天,共享文件,有聊天室等,是一个有TCP和UDP协议的聊天软件,测试过很多次在CENTOS和UBUNTU下都通 ...