When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list ordata_frame.

Please read on for our handy hints on keeping your data handles neat.

When using R to work over a big data system (such as Spark) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).

Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.

Let’s set-up our example Spark cluster:

library("sparklyr")
#packageVersion('sparklyr')
suppressPackageStartupMessages(library("dplyr"))
#packageVersion('dplyr')
suppressPackageStartupMessages(library("tidyr")) # Please see the following video for installation help
# https://youtu.be/qnINvPqcRvE
# spark_install(version = "2.0.2") # set up a local "practice" Spark instance
sc <- spark_connect(master = "local",
version = "2.0.2")
#print(sc)

Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths ofparquet data we wish to work with from an external file that is easy for the user to edit.

# Read user's specification of files and paths.
userSpecification <- read.csv('tableCollection.csv',
header = TRUE,
strip.white = TRUE,
stringsAsFactors = FALSE)
print(userSpecification)
##   tableName tablePath
## 1 data_01 data_01
## 2 data_02 data_02
## 3 data_03 data_03

We can now read these parquet files (usually stored in Hadoop) into ourSpark environment as follows.

readParquets <- function(userSpecification) {
userSpecification <- as_data_frame(userSpecification)
userSpecification$handle <- lapply(
seq_len(nrow(userSpecification)),
function(i) {
spark_read_parquet(sc,
name = userSpecification$tableName[[i]],
path = userSpecification$tablePath[[i]])
}
)
userSpecification
} tableCollection <- readParquets(userSpecification)
print(tableCollection)
## # A tibble: 3 x 3
## tableName tablePath handle
## <chr> <chr> <list>
## 1 data_01 data_01 <S3: tbl_spark>
## 2 data_02 data_02 <S3: tbl_spark>
## 3 data_03 data_03 <S3: tbl_spark>

data.frame is a great place to keep what you know about your Sparkhandles in one place. Let’s add some details to our Spark handles.

addDetails <- function(tableCollection) {
tableCollection <- as_data_frame(tableCollection)
# get the references
tableCollection$handle <-
lapply(tableCollection$tableName,
function(tableNamei) {
dplyr::tbl(sc, tableNamei)
}) # and tableNames to handles for convenience
# and printing
names(tableCollection$handle) <-
tableCollection$tableName # add in some details (note: nrow can be expensive)
tableCollection$nrow <- vapply(tableCollection$handle,
nrow,
numeric(1))
tableCollection$ncol <- vapply(tableCollection$handle,
ncol,
numeric(1))
tableCollection
} tableCollection <- addDetails(userSpecification) # convenient printing
print(tableCollection)
## # A tibble: 3 x 5
## tableName tablePath handle nrow ncol
## <chr> <chr> <list> <dbl> <dbl>
## 1 data_01 data_01 <S3: tbl_spark> 10 1
## 2 data_02 data_02 <S3: tbl_spark> 10 2
## 3 data_03 data_03 <S3: tbl_spark> 10 3
# look at the top of each table (also forces
# evaluation!).
lapply(tableCollection$handle,
head)
## $data_01
## Source: query [6 x 1]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 1
## a_01
## <dbl>
## 1 0.8274947
## 2 0.2876151
## 3 0.6638404
## 4 0.1918336
## 5 0.9111187
## 6 0.8802026
##
## $data_02
## Source: query [6 x 2]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 2
## a_02 b_02
## <dbl> <dbl>
## 1 0.3937457 0.34936496
## 2 0.0195079 0.74376380
## 3 0.9760512 0.00261368
## 4 0.4388773 0.70325800
## 5 0.9747534 0.40327283
## 6 0.6054003 0.53224218
##
## $data_03
## Source: query [6 x 3]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 3
## a_03 b_03 c_03
## <dbl> <dbl> <dbl>
## 1 0.59512263 0.2615939 0.592753768
## 2 0.72292799 0.7287428 0.003926143
## 3 0.51846687 0.3641869 0.874463146
## 4 0.01174093 0.9648346 0.177722575
## 5 0.86250126 0.3891915 0.857614579
## 6 0.33082723 0.2633013 0.233822140

A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.

columnDictionary <- function(tableCollection) {
tableCollection$columns <-
lapply(tableCollection$handle,
colnames)
columnMap <- tableCollection %>%
select(tableName, columns) %>%
unnest(columns)
columnMap
} columnMap <- columnDictionary(tableCollection)
print(columnMap)
## # A tibble: 6 x 2
## tableName columns
## <chr> <chr>
## 1 data_01 a_01
## 2 data_02 a_02
## 3 data_02 b_02
## 4 data_03 a_03
## 5 data_03 b_03
## 6 data_03 c_03

The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.

The principles we are using include:

  • Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
  • Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use R tools such as tidyr::unnest() to work with it).

转自:http://www.win-vector.com/blog/2017/05/managing-spark-data-handles-in-r/

Managing Spark data handles in R的更多相关文章

  1. 7 Tools for Data Visualization in R, Python, and Julia

    7 Tools for Data Visualization in R, Python, and Julia Last week, some examples of creating visualiz ...

  2. Managing Hierarchical Data in MySQL

    Managing Hierarchical Data in MySQL Introduction Most users at one time or another have dealt with h ...

  3. An Introduction to Stock Market Data Analysis with R (Part 1)

    Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...

  4. Best packages for data manipulation in R

    dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...

  5. sc.textFile("file:///home/spark/data.txt") Input path does not exist解决方法——submit 加参数 --master local 即可解决

    use this val data = sc.textFile("/home/spark/data.txt") this should work and set master as ...

  6. Data Science With R In Visual Studio

    R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...

  7. mysql 树形数据,层级数据Managing Hierarchical Data in MySQL

    原文:http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ 引言 大多数用户都曾在数据库中处理过分层数据(hiera ...

  8. Managing Hierarchical Data in MySQL(邻接表模型)[转载]

    原文在:http://dev.mysql.com/tech-resources/articles/hierarchical-data.html 来源: http://www.cnblogs.com/p ...

  9. Java生成-zipf分布的数据集(自定义倾斜度,用作spark data skew测试)

    1.代码 import java.io.Serializable; import java.util.NavigableMap; import java.util.Random; import jav ...

随机推荐

  1. Map和Set

    JavaScript的默认对象表示方式{}可以视为其他语言中的Map或Dictionary的数据结构,即一组键值对. 但是JavaScript的对象有个小问题,就是键必须是字符串.但实际上Number ...

  2. VC++内置数据类型存储及取值范围

    亲测,基于win7 32位,vs2012编译 结果: 代码: #include "stdafx.h" #include <iostream> #include < ...

  3. 用extjs6.0写一个点击新建窗口的功能

    一.写一个按钮 注意id { id: 'ListEdit', text:'编辑', iconCls:'x-fa fa-edit' } 二.写新建的页面 下面我新建的是表单,有几点需要注意的: ① 因为 ...

  4. JS实现图片不间断滚动

    方法一: <!DOCTYPE html><html> <head> <meta charset="UTF-8"> <title ...

  5. Javascript一道面试题

    实现一个函数,运算结果可以满足如下预期结果: add(1)(2) // 3add(1, 2, 3)(10) // 16 add(1)(2)(3)(4)(5) // 15 function add () ...

  6. 为linux系统实现回收站

    在linux系统中,经常采用"rm *"或"rm -r *"操作删除一下文件,但是有时某些文件并不是我们想要删除的,但是已经被删除.很多时候都是悲剧的,数据是难 ...

  7. linux性能之iostat

    在使用linux系统的过程中,总是可能需要当前io性能的状态信息是怎么样?这里就就是一下iostat,可以通过iostat来初步查看io的状态信息. 1.常用方式 iostat -xdk 1 10 或 ...

  8. Less与Sass

    less 1.变量    声明变量:@变量名:变量值 使用变量:@变量名 >>>Less中变量的类型 ①数字类:1 100px ②字符串:无引号字符串[red] 有引号字符串[&qu ...

  9. 微信公众号开发笔记2(nodejs)

    本篇主要记录调用微信各种api和功能实现 一.始于access_token 无论调用微信的什么api,都需要一个查询参数,就是我们每隔1小时或者2小时获取的access_token,笔记1中已经保证了 ...

  10. Ehcache 的简单实用 及配置

    Ehcache 与 spring 整合后的用法,下面是一个Ehcache.xml配置文件: 通用的缓存策略 可以用一个 cache: <?xml version="1.0" ...