Intro

In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyRinterface. However, using Spark through the pySparkinterface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelinesin Spark very elegant.

In using Spark I like to share two little tricks described below with you.

The RFormula feature selector

As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula

R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+ , . and :) , but it is enough to be useful. The two figures below show a simple example.

1
2
3
4
from pyspark.ml.feature import RFormula
f1 = "Targetf ~ paidDuration + Gender "
formula = RFormula(formula = f1)
train2 = formula.fit(train).transform(train)

A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.

Nested (hierarchical) data in sparklyR

Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:Then to flatten the data you could use:In SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.Fortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.After registering the output object, it is visible in the Spark interface and you can view the content.

Thanks for reading my two tricks. Cheers, Longhow.

转自:https://longhowlam.wordpress.com/2017/02/15/r-formulas-in-spark-and-un-nesting-data-in-sparklyr-nice-and-handy/

R formulas in Spark and un-nesting data in SparklyR: Nice and handy!的更多相关文章

  1. Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)

    文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...

  2. 大数据工具比较:R 语言和 Spark 谁更胜一筹?

    本文有两重目的,一是在性能方面快速对比下R语言和Spark,二是想向大家介绍下Spark的机器学习库 背景介绍 由于R语言本身是单线程的,所以可能从性能方面对比Spark和R并不是很明智的做法.即使这 ...

  3. Using Apache Spark and MySQL for Data Analysis

    What is Spark Apache Spark is a cluster computing framework, similar to Apache Hadoop. Wikipedia has ...

  4. 【译】Using .NET for Apache Spark to Analyze Log Data

    .NET for Spark可用于处理成批数据.实时流.机器学习和ad-hoc查询.在这篇博客文章中,我们将探讨如何使用.NET for Spark执行一个非常流行的大数据任务,即日志分析. 1 什么 ...

  5. Spark性能优化之道——解决Spark数据倾斜(Data Skew)的N种姿势

    原创文章,同步首发自作者个人博客转载请务必在文章开头处注明出处. 摘要 本文结合实例详细阐明了Spark数据倾斜的几种场景以及对应的解决方案,包括避免数据源倾斜,调整并行度,使用自定义Partitio ...

  6. R class of subset of matrix and data.frame

    a = matrix(     c(2, 4, 3, 1, 5, 7), # the data elements     nrow=2,              # number of rows   ...

  7. Spark性能调优之道——解决Spark数据倾斜(Data Skew)的N种姿势

    原文:http://blog.csdn.net/tanglizhe1105/article/details/51050974 背景 很多使用Spark的朋友很想知道rdd里的元素是怎么存储的,它们占用 ...

  8. Spark SQL is a Spark module for structured data processing.

    http://spark.apache.org/docs/latest/sql-programming-guide.html

  9. Managing Spark data handles in R

    When working with big data with R (say, using Spark and sparklyr) we have found it very convenient t ...

随机推荐

  1. MySQL用户认证及权限grant-revoke

    一.MySQL用户认证: 登录并不属于访问控制机制,而属于用户身份识别和认证: 1.用户名-user 2.密码-password 3.登录mysqld主机-host 实现用户登录MySQL,建立连接. ...

  2. [USACO11NOV]牛的障碍Cow Steeplechase

    洛谷传送门 题目描述: 给出N平行于坐标轴的线段,要你选出尽量多的线段使得这些线段两两没有交点(顶点也算),横的与横的,竖的与竖的线段之间保证没有交点,输出最多能选出多少条线段. 因为横的与横的,竖的 ...

  3. C++小技巧之CONTAINING_RECORD

    CONTAINING_RECORD Containing record是一个在C++编程中用处很大的一种技巧,它的功能为已知结构体或类的某一成员.对象中该成员的地址以及这一结构体名或类名,从而得到该对 ...

  4. nodejs + nginx + ECS阿里云服务器环境设置

    nodejs + nginx + ECS阿里云服务器环境设置 部署 nodejs ECS 基于 CentOS7.2 详细步骤:click 部署 nginx 安装 添加Nginx软件库: [root@l ...

  5. MongDB系列(一):使用node.js连接数据库

    1.首先启动mongodb数据库服务器 2.创建app.js,代码如下: /** * Created by byzy on 2016/8/18. * node.js 连接 mongodb实例 */ / ...

  6. Vue 事件驱动和依赖追踪

    之前关于 Vue 数据绑定原理的一点分析,最近需要回顾,就顺便发到随笔上了 在之前实现一个自己的Mvvm中,用 setter 来观测model,将界面上所有的 viewModel 绑定到 model ...

  7. 烧录口被初始化为普通IO

    烧录口被初始化为普通IO后如果复位端没有的烧录口会导致不能识别烧录器不能下载与调试,因为程序一开始就把端口初始化了,烧录器不能识别,添加复位端口到烧录器(前提是你的烧录器有复位端). 有了复位段之后, ...

  8. [ext4]09 磁盘布局 - superblock备份机制

    如果sparse_super特性flag被设置(即开启了sparse_super特性),那么super_block和组描述符的副本只会保存在group索引为0或3.5.7的整数幂. 如果没有设置spa ...

  9. 学习sql基础注入的方法

    作为一个初学者的我,经学习发现基础真的十分重要, 这个随笔是写给我自己的希望我能坚持住 当然,我也希望对其他人有点帮助 在sql注入的过程中,我越发感觉那些基础函数的重要性 其实我感觉sql注入其实就 ...

  10. bash Shell条件测试

    3种测试命令: test EXPRESSION [ EXPRESSION ] [[ EXPRESSION ]]  注意:EXPRESSION前后必须有空白字符 bash的测试类型 数值测试: -eq: ...