R formulas in Spark and un-nesting data in SparklyR: Nice and handy!
Intro
In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyRinterface. However, using Spark through the pySparkinterface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelinesin Spark very elegant.
In using Spark I like to share two little tricks described below with you.
The RFormula feature selector
As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula
R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+, – , . and :) , but it is enough to be useful. The two figures below show a simple example.
1
2
3
4
|
from pyspark.ml.feature import RFormula f1 = "Targetf ~ paidDuration + Gender " formula = RFormula(formula = f1) train2 = formula.fit(train).transform(train) |
A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.
Nested (hierarchical) data in sparklyR
Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:Then to flatten the data you could use:In SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.Fortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.After registering the output object, it is visible in the Spark interface and you can view the content.
Thanks for reading my two tricks. Cheers, Longhow.
转自:https://longhowlam.wordpress.com/2017/02/15/r-formulas-in-spark-and-un-nesting-data-in-sparklyr-nice-and-handy/
R formulas in Spark and un-nesting data in SparklyR: Nice and handy!的更多相关文章
- Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)
文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...
- 大数据工具比较:R 语言和 Spark 谁更胜一筹?
本文有两重目的,一是在性能方面快速对比下R语言和Spark,二是想向大家介绍下Spark的机器学习库 背景介绍 由于R语言本身是单线程的,所以可能从性能方面对比Spark和R并不是很明智的做法.即使这 ...
- Using Apache Spark and MySQL for Data Analysis
What is Spark Apache Spark is a cluster computing framework, similar to Apache Hadoop. Wikipedia has ...
- 【译】Using .NET for Apache Spark to Analyze Log Data
.NET for Spark可用于处理成批数据.实时流.机器学习和ad-hoc查询.在这篇博客文章中,我们将探讨如何使用.NET for Spark执行一个非常流行的大数据任务,即日志分析. 1 什么 ...
- Spark性能优化之道——解决Spark数据倾斜(Data Skew)的N种姿势
原创文章,同步首发自作者个人博客转载请务必在文章开头处注明出处. 摘要 本文结合实例详细阐明了Spark数据倾斜的几种场景以及对应的解决方案,包括避免数据源倾斜,调整并行度,使用自定义Partitio ...
- R class of subset of matrix and data.frame
a = matrix( c(2, 4, 3, 1, 5, 7), # the data elements nrow=2, # number of rows ...
- Spark性能调优之道——解决Spark数据倾斜(Data Skew)的N种姿势
原文:http://blog.csdn.net/tanglizhe1105/article/details/51050974 背景 很多使用Spark的朋友很想知道rdd里的元素是怎么存储的,它们占用 ...
- Spark SQL is a Spark module for structured data processing.
http://spark.apache.org/docs/latest/sql-programming-guide.html
- Managing Spark data handles in R
When working with big data with R (say, using Spark and sparklyr) we have found it very convenient t ...
随机推荐
- 一些CSS/JS小技巧
CSS部分 1.文本框不可点击 .inputDisabled{ background-color: #eee;cursor: not-allowed;} 2.禁止复制粘贴 onpaste=" ...
- Servlet(一)基础总结
一.Servlet概述 1.Java Servlet是基于Java的一种技术和标准,是独立于平台和协议,服务器端的java应用程序.与Applet相比.Applet运行在客户端,而Servlet运行在 ...
- React组件实现越级传递属性
如果有这样一个结构:三级嵌套,分别是:一级父组件.二级子组件.三级孙子组件,且前者包含后者,结构如图: 如果把一个属性,比如color,从一级传递给三级,一般做法是使用props逐一向下传递,代码如下 ...
- mac下安装Java开发环境
1.安装JDK 打开网页,进入jdk官网下:http://www.oracle.com/technetwork/java/javase/downloads/index.html 下载后,进入finde ...
- jQuery的工作原理
jQuery是为了改变javascript的编码方式而设计的. jQuery本身并不是UI组件库或其他的一般AJAX类库. 那么它是如何实现它的声明的呢? 先看一段简短的使用流程: (1).查找(创建 ...
- Mac 下载安装MySQL
step 1. 从官网上下载MySQL Community Server step 2. 安装MySQL step 3. 配置mysql和mysqladmin的alias $ vim ~/.bashr ...
- FrameBuffer系列 之 介绍
1. 来由 FrameBuffer是出现在2.2.xx内核当中的一种驱动程序接口.Linux工作在保护模式下,所以用户态进程是无法象 DOS 那样使用显卡 BIOS里提供的中断调用来实现直接写 ...
- 使用Java POI来选择提取Word文档中的表格信息
通过使用Java POI来提取Word(1992)文档中的表格信息,其中POI支持不同的ms文档类型,在具体操作中需要注意.本文主要是通过POI来提取微软2003文档中的表格信息,具体code如下(事 ...
- Android系统--输入系统(十二)Dispatch线程_总体框架
Android系统--输入系统(十二)Dispatch线程_总体框架 1. Dispatch线程框架 我们知道Dispatch线程是分发之意,那么便可以引入两个问题:1. 发什么;2. 发给谁.这两个 ...
- 使用Spire.Doc组件利用模板导出Word文档
以前一直是用Office的组件实现Word文档导出,但是让客户在服务器安装Office,涉及到版权:而且Office安装,包括权限配置也是比较麻烦. 现在流行使用第三方组件来实现对Office的操作, ...