Spark特征(提取,转换,选择)extracting, transforming and selecting features
VectorAssembler字段转换成特征向量
import org.apache.spark.ml.feature.VectorAssembler val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating") // 字段转换成特征向量
val assembler = new VectorAssembler().setInputCols(colArray).setOutputCol("features") val vecDF: DataFrame = assembler.transform(data)
vecDF: org.apache.spark.sql.DataFrame = [affairs: double, gender: string ... 8 more fields] vecDF.select("features", colArray: _*).show(10, truncate = false)
+----------------------------+----+------------+-------------+---------+----------+------+
|features |age |yearsmarried|religiousness|education|occupation|rating|
+----------------------------+----+------------+-------------+---------+----------+------+
|[37.0,10.0,3.0,18.0,7.0,4.0]|37.0|10.0 |3.0 |18.0 |7.0 |4.0 |
|[27.0,4.0,4.0,14.0,6.0,4.0] |27.0|4.0 |4.0 |14.0 |6.0 |4.0 |
|[32.0,15.0,1.0,12.0,1.0,4.0]|32.0|15.0 |1.0 |12.0 |1.0 |4.0 |
|[57.0,15.0,5.0,18.0,6.0,5.0]|57.0|15.0 |5.0 |18.0 |6.0 |5.0 |
|[22.0,0.75,2.0,17.0,6.0,3.0]|22.0|0.75 |2.0 |17.0 |6.0 |3.0 |
|[32.0,1.5,2.0,17.0,5.0,5.0] |32.0|1.5 |2.0 |17.0 |5.0 |5.0 |
|[22.0,0.75,2.0,12.0,1.0,3.0]|22.0|0.75 |2.0 |12.0 |1.0 |3.0 |
|[57.0,15.0,2.0,14.0,4.0,4.0]|57.0|15.0 |2.0 |14.0 |4.0 |4.0 |
|[32.0,15.0,4.0,16.0,1.0,2.0]|32.0|15.0 |4.0 |16.0 |1.0 |2.0 |
|[22.0,1.5,4.0,14.0,4.0,5.0] |22.0|1.5 |4.0 |14.0 |4.0 |5.0 |
+----------------------------+----+------------+-------------+---------+----------+------+
only showing top 10 rows
VectorIndexer自动识别分类的特征,并对它们进行索引
import org.apache.spark.ml.feature.VectorIndexer val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating") // 自动识别分类的特征,并对它们进行索引
// 具有大于7个不同的值的特征被视为连续。
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(7)
.fit(vecDF) val categoricalFeatures: Set[Int] = featureIndexer.categoryMaps.keys.toSet
categoricalFeatures: Set[Int] = Set(2, 3, 4, 5) println(s"Chose ${categoricalFeatures.size} categorical features: " +
categoricalFeatures.mkString(", "))
Chose 4 categorical features: 2, 3, 4, 5 // 由此看出,当MaxCategories=7,从6个字段中识别出了4个“类别特征字段”,
// 他们的下标索引为(2, 3, 4, 5),分别对应colArray中的(2, 3, 4, 5)元素,即"religiousness", "education", "occupation", "rating"
// 为什么识别出了4个“类别特征字段”呢,请看本人博客http://www.cnblogs.com/wwxbi/p/6125363.html“统计字段中元素的个数”
// 从“统计字段中元素的个数”看出,("religiousness", "education", "occupation", "rating")这4个字段的元素个数<=7 // Create new column "indexedFeatures" with categorical values transformed to indices
val indexedData = featureIndexer.transform(vecDF)
indexedData: org.apache.spark.sql.DataFrame = [affairs: double, gender: string ... 9 more fields] val resColArray = Array("indexedFeatures", "features", "age", "yearsmarried", "religiousness", "education", "occupation", "rating")
resColArray: Array[String] = Array(indexedFeatures, features, age, yearsmarried, religiousness, education, occupation, rating) indexedData.selectExpr(resColArray: _*).show(10, truncate = false)
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
|indexedFeatures |features |age |yearsmarried|religiousness|education|occupation|rating|
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
|[37.0,10.0,2.0,5.0,6.0,3.0]|[37.0,10.0,3.0,18.0,7.0,4.0]|37.0|10.0 |3.0 |18.0 |7.0 |4.0 |
|[27.0,4.0,3.0,2.0,5.0,3.0] |[27.0,4.0,4.0,14.0,6.0,4.0] |27.0|4.0 |4.0 |14.0 |6.0 |4.0 |
|[32.0,15.0,0.0,1.0,0.0,3.0]|[32.0,15.0,1.0,12.0,1.0,4.0]|32.0|15.0 |1.0 |12.0 |1.0 |4.0 |
|[57.0,15.0,4.0,5.0,5.0,4.0]|[57.0,15.0,5.0,18.0,6.0,5.0]|57.0|15.0 |5.0 |18.0 |6.0 |5.0 |
|[22.0,0.75,1.0,4.0,5.0,2.0]|[22.0,0.75,2.0,17.0,6.0,3.0]|22.0|0.75 |2.0 |17.0 |6.0 |3.0 |
|[32.0,1.5,1.0,4.0,4.0,4.0] |[32.0,1.5,2.0,17.0,5.0,5.0] |32.0|1.5 |2.0 |17.0 |5.0 |5.0 |
|[22.0,0.75,1.0,1.0,0.0,2.0]|[22.0,0.75,2.0,12.0,1.0,3.0]|22.0|0.75 |2.0 |12.0 |1.0 |3.0 |
|[57.0,15.0,1.0,2.0,3.0,3.0]|[57.0,15.0,2.0,14.0,4.0,4.0]|57.0|15.0 |2.0 |14.0 |4.0 |4.0 |
|[32.0,15.0,3.0,3.0,0.0,1.0]|[32.0,15.0,4.0,16.0,1.0,2.0]|32.0|15.0 |4.0 |16.0 |1.0 |2.0 |
|[22.0,1.5,3.0,2.0,3.0,4.0] |[22.0,1.5,4.0,14.0,4.0,5.0] |22.0|1.5 |4.0 |14.0 |4.0 |5.0 |
+---------------------------+----------------------------+----+------------+-------------+---------+----------+------+
only showing top 10 rows import org.apache.spark.ml.feature.VectorSlicer val slicer = new VectorSlicer().setInputCol("indexedFeatures").setOutputCol("slicerFeatures")
slicer.setIndices(Array(3)) // 此处的3对应“索引化”之前的字段“education” val output = slicer.transform(indexedData)
output.select("indexedFeatures",
"slicerFeatures",
"education").limit(10).orderBy($"education").show(10, truncate = false)
+---------------------------+--------------+---------+
|indexedFeatures |slicerFeatures|education|
+---------------------------+--------------+---------+
|[32.0,15.0,0.0,1.0,0.0,3.0]|[1.0] |12.0 |
|[22.0,0.75,1.0,1.0,0.0,2.0]|[1.0] |12.0 |
|[27.0,4.0,3.0,2.0,5.0,3.0] |[2.0] |14.0 |
|[57.0,15.0,1.0,2.0,3.0,3.0]|[2.0] |14.0 |
|[22.0,1.5,3.0,2.0,3.0,4.0] |[2.0] |14.0 |
|[32.0,15.0,3.0,3.0,0.0,1.0]|[3.0] |16.0 |
|[32.0,1.5,1.0,4.0,4.0,4.0] |[4.0] |17.0 |
|[22.0,0.75,1.0,4.0,5.0,2.0]|[4.0] |17.0 |
|[37.0,10.0,2.0,5.0,6.0,3.0]|[5.0] |18.0 |
|[57.0,15.0,4.0,5.0,5.0,4.0]|[5.0] |18.0 |
+---------------------------+--------------+---------+
// 由此看出,“类别特征字段”被索引化后,索引的编号是跟“原字段值的大小顺序”对照的,索引从0开始
// 索引编号(0,1,2,3,4,5,6)对应[9.0, 12.0, 14.0, 16.0, 17.0, 18.0, 20.0]
VectorSlicer向量切割
import org.apache.spark.ml.feature.VectorSlicer val colArray = Array("age", "yearsmarried", "religiousness", "education", "occupation", "rating") // 字段转换成特征向量
val assembler = new VectorAssembler().setInputCols(colArray).setOutputCol("features")
val vecDF = assembler.transform(data) val slicer = new VectorSlicer().setInputCol("features").setOutputCol("slicerFeatures")
// 指定“向量字段features”中的下标索引
// (2, 3, 4)分别对应字段("religiousness", "education", "occupation")
slicer.setIndices(Array(2, 3, 4)) val output = slicer.transform(vecDF)
output.select("features", "slicerFeatures","religiousness", "education", "occupation").show(10, truncate = false)
+----------------------------+--------------+-------------+---------+----------+
|features |slicerFeatures|religiousness|education|occupation|
+----------------------------+--------------+-------------+---------+----------+
|[37.0,10.0,3.0,18.0,7.0,4.0]|[3.0,18.0,7.0]|3.0 |18.0 |7.0 |
|[27.0,4.0,4.0,14.0,6.0,4.0] |[4.0,14.0,6.0]|4.0 |14.0 |6.0 |
|[32.0,15.0,1.0,12.0,1.0,4.0]|[1.0,12.0,1.0]|1.0 |12.0 |1.0 |
|[57.0,15.0,5.0,18.0,6.0,5.0]|[5.0,18.0,6.0]|5.0 |18.0 |6.0 |
|[22.0,0.75,2.0,17.0,6.0,3.0]|[2.0,17.0,6.0]|2.0 |17.0 |6.0 |
|[32.0,1.5,2.0,17.0,5.0,5.0] |[2.0,17.0,5.0]|2.0 |17.0 |5.0 |
|[22.0,0.75,2.0,12.0,1.0,3.0]|[2.0,12.0,1.0]|2.0 |12.0 |1.0 |
|[57.0,15.0,2.0,14.0,4.0,4.0]|[2.0,14.0,4.0]|2.0 |14.0 |4.0 |
|[32.0,15.0,4.0,16.0,1.0,2.0]|[4.0,16.0,1.0]|4.0 |16.0 |1.0 |
|[22.0,1.5,4.0,14.0,4.0,5.0] |[4.0,14.0,4.0]|4.0 |14.0 |4.0 |
+----------------------------+--------------+-------------+---------+----------+
only showing top 10 rows output.printSchema()
root
|-- affairs: double (nullable = false)
|-- gender: string (nullable = true)
|-- age: double (nullable = false)
|-- yearsmarried: double (nullable = false)
|-- children: string (nullable = true)
|-- religiousness: double (nullable = false)
|-- education: double (nullable = false)
|-- occupation: double (nullable = false)
|-- rating: double (nullable = false)
|-- features: vector (nullable = true)
|-- slicerFeatures: vector (nullable = true)
Bucketizer将连续数据离散化到指定的范围区间
import org.apache.spark.ml.feature.Bucketizer // Double.NegativeInfinity:负无穷;Double.PositiveInfinity:正无穷
// 分为6个组:[负无穷,-100),[-100,-10),[-10,0),[0,10),[10,90),[90,正无穷)
val splits = Array(Double.NegativeInfinity, -100, -10, 0.0, 10, 90, Double.PositiveInfinity) val data: Array[Double] = Array(-180,-160,-100,-50,-70,-20,-8,-5,-3, 0.0, 1,3,7,10,30,60,90,100,120,150) val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
dataFrame: org.apache.spark.sql.DataFrame = [features: double] val bucketizer = new Bucketizer()
.setInputCol("features")
.setOutputCol("bucketedFeatures")
.setSplits(splits) // 将原始数据转换为桶索引
val bucketedData = bucketizer.transform(dataFrame)
bucketedData: org.apache.spark.sql.DataFrame = [features: double, bucketedFeatures: double] bucketedData.show(50,truncate=false)
+--------+----------------+
|features|bucketedFeatures|
+--------+----------------+
|-180.0 |0.0 |
|-160.0 |0.0 |
|-100.0 |1.0 |
|-50.0 |1.0 |
|-70.0 |1.0 |
|-20.0 |1.0 |
|-8.0 |2.0 |
|-5.0 |2.0 |
|-3.0 |2.0 |
|0.0 |3.0 |
|1.0 |3.0 |
|3.0 |3.0 |
|7.0 |3.0 |
|10.0 |4.0 |
|30.0 |4.0 |
|60.0 |4.0 |
|90.0 |5.0 |
|100.0 |5.0 |
|120.0 |5.0 |
|150.0 |5.0 |
+--------+----------------+
Spark特征(提取,转换,选择)extracting, transforming and selecting features的更多相关文章
- Spark Extracting,transforming,selecting features
Spark(3) - Extracting, transforming, selecting features 官方文档链接:https://spark.apache.org/docs/2.2.0/m ...
- Extracting and composing robust features with denosing autoencoders 论文
这是一篇发表于2008年初的论文. 文章主要讲了利用 denosing autoencoder来学习 robust的中间特征..进上步,说明,利用这个方法,可以初始化神经网络的权值..这就相当于一种非 ...
- 【Spark篇】---Spark中Transformations转换算子
一.前述 Spark中默认有两大类算子,Transformation(转换算子),懒执行.action算子,立即执行,有一个action算子 ,就有一个job. 通俗些来说由RDD变成RDD就是Tra ...
- 【转】Spark实现行列转换pivot和unpivot
背景 做过数据清洗ETL工作的都知道,行列转换是一个常见的数据整理需求.在不同的编程语言中有不同的实现方法,比如SQL中使用case+group,或者Power BI的M语言中用拖放组件实现.今天正好 ...
- Spark中RDD转换成DataFrame的两种方式(分别用Java和Scala实现)
一:准备数据源 在项目下新建一个student.txt文件,里面的内容为: ,zhangsan, ,lisi, ,wanger, ,fangliu, 二:实现 Java版: 1.首先新建一个s ...
- Spark之 RDD转换成DataFrame的Scala实现
依赖 <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2. ...
- Android oat文件提取转换
说明: 1.手机厂商可以修改Android源码并进行编译后再生成oat格式文件在手机上存储,比如boot-okhttp.oat,boot-framework.oat. 2.自带的apk可以调用这些模块 ...
- 【spark】常用转换操作:join
join就表示内连接. 对于内链接,对于给定的两个输入数据集(k,v1)和(k,v2) 根据相同的k进行连接,最终得到(k,(v1,v2))的数据集. 示例 val arr1 = Array((&qu ...
- 【spark】常用转换操作:keys 、values和mapValues
1.keys 功能: 返回所有键值对的key 示例 val list = List("hadoop","spark","hive",&quo ...
随机推荐
- mysql触发器应用和创建表错误代码: 1118 Row size too large. 解决
1.针对数据库查询问题的方便,可以建立重要表的log备份记录表,在主表的添加,修改,删除添加触发器,修改触发器增加触发字段的点,限制条件. 数据库log表查问题比从线上多台服务器上下载日志文件相对方便 ...
- Python游戏《外星人入侵》来了~
在游戏<外星人入侵>中,玩家控制着一艘最初出现在屏幕底部中央的飞船.玩家可以使用箭头键左右移动飞船,还可使用空格键进行射击.游戏开始时,一群外星人出现在天空中,他们在屏幕中向下移动.玩家的 ...
- Reg命令使用详解 批处理操作注册表必备
首先要说明:编辑注册表不当可能会严重损坏您的系统.在更改注册表之前,应备份计算机上任何有价值的数据 只有在别无选择的情况下,才直接编辑注册表.注册表编辑器会忽略标准的安全措施,从而使得这些设置会降低性 ...
- ios7注意事项随笔
1,修改状态栏的样式和隐藏. 首先,需要在Info.plist配置文件中,增加键:UIViewControllerBasedStatusBarAppearance,并设置为YES: 然后,在UIVie ...
- 6.824 Lab 5: Caching Extents
Introduction In this lab you will modify YFS to cache extents, reducing the load on the extent serve ...
- 使用Ajax long polling实现简单的聊天程序
关于web实时通信,通常使用长轮询或这长连接方式进行实现. 为了能够实际体会长轮询,通过Ajax长轮询实现了一个简单的聊天程序,在此作为笔记. 长轮询 传统的轮询方式是,客户端定时(一般使用setIn ...
- linux环境下执行RF测试脚本
1. 测试执行 测试管理平台需根据用户选中的测试案例,按照相应格式对执行启动命令进行组装和发送,触动案例的自动化测试执行. 命令格式根据测试模式(以项目为单位.以测试集为单位.以案例为单位)具有不同的 ...
- 【RF库XML测试】Get Element Text
Name:Get Element TextSource:XML <test library>Arguments:[ source | xpath=. | normalize_whitesp ...
- Django 定义数据模型
如何定义数据模型: (1) 在 MVC 设计模式中,M 表示数据模型 ( Model ),负责业务对象与数据库的映射,我们可以通过应用的 models.py 来定义数据模型(2) Model 采用了 ...
- Linux 下 c 语言 聊天软件
这是我学C语言写的第一个软件,是一个完整的聊天软件,里面包括客户端,和服务器端,可以互现聊天,共享文件,有聊天室等,是一个有TCP和UDP协议的聊天软件,测试过很多次在CENTOS和UBUNTU下都通 ...