Managing Spark data handles in R
When working with big data with R
(say, using Spark
and sparklyr
) we have found it very convenient to keep data handles in a neat list ordata_frame
.
Please read on for our handy hints on keeping your data handles neat.
When using R
to work over a big data system (such as Spark
) much of your work is over "data handles" and not actual data (data handles are objects that control access to remote data).
Data handles are a lot like sockets or file-handles in that they can not be safely serialized and restored (i.e., you can not save them into a .RDS
file and then restore them into another session). This means when you are starting or re-starting a project you must "ready" all of your data references. Your projects will be much easier to manage and document if you load your references using the methods we show below.
Let’s set-up our example Spark
cluster:
library("sparklyr")
#packageVersion('sparklyr')
suppressPackageStartupMessages(library("dplyr"))
#packageVersion('dplyr')
suppressPackageStartupMessages(library("tidyr"))
# Please see the following video for installation help
# https://youtu.be/qnINvPqcRvE
# spark_install(version = "2.0.2")
# set up a local "practice" Spark instance
sc <- spark_connect(master = "local",
version = "2.0.2")
#print(sc)
Data is much easier to manage than code, and much easier to compute over. So the more information you can keep as pure data the better off you will be. In this case we are loading the chosen names and paths ofparquet
data we wish to work with from an external file that is easy for the user to edit.
# Read user's specification of files and paths.
userSpecification <- read.csv('tableCollection.csv',
header = TRUE,
strip.white = TRUE,
stringsAsFactors = FALSE)
print(userSpecification)
## tableName tablePath
## 1 data_01 data_01
## 2 data_02 data_02
## 3 data_03 data_03
We can now read these parquet
files (usually stored in Hadoop
) into ourSpark
environment as follows.
readParquets <- function(userSpecification) {
userSpecification <- as_data_frame(userSpecification)
userSpecification$handle <- lapply(
seq_len(nrow(userSpecification)),
function(i) {
spark_read_parquet(sc,
name = userSpecification$tableName[[i]],
path = userSpecification$tablePath[[i]])
}
)
userSpecification
}
tableCollection <- readParquets(userSpecification)
print(tableCollection)
## # A tibble: 3 x 3
## tableName tablePath handle
## <chr> <chr> <list>
## 1 data_01 data_01 <S3: tbl_spark>
## 2 data_02 data_02 <S3: tbl_spark>
## 3 data_03 data_03 <S3: tbl_spark>
A data.frame
is a great place to keep what you know about your Spark
handles in one place. Let’s add some details to our Spark
handles.
addDetails <- function(tableCollection) {
tableCollection <- as_data_frame(tableCollection)
# get the references
tableCollection$handle <-
lapply(tableCollection$tableName,
function(tableNamei) {
dplyr::tbl(sc, tableNamei)
})
# and tableNames to handles for convenience
# and printing
names(tableCollection$handle) <-
tableCollection$tableName
# add in some details (note: nrow can be expensive)
tableCollection$nrow <- vapply(tableCollection$handle,
nrow,
numeric(1))
tableCollection$ncol <- vapply(tableCollection$handle,
ncol,
numeric(1))
tableCollection
}
tableCollection <- addDetails(userSpecification)
# convenient printing
print(tableCollection)
## # A tibble: 3 x 5
## tableName tablePath handle nrow ncol
## <chr> <chr> <list> <dbl> <dbl>
## 1 data_01 data_01 <S3: tbl_spark> 10 1
## 2 data_02 data_02 <S3: tbl_spark> 10 2
## 3 data_03 data_03 <S3: tbl_spark> 10 3
# look at the top of each table (also forces
# evaluation!).
lapply(tableCollection$handle,
head)
## $data_01
## Source: query [6 x 1]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 1
## a_01
## <dbl>
## 1 0.8274947
## 2 0.2876151
## 3 0.6638404
## 4 0.1918336
## 5 0.9111187
## 6 0.8802026
##
## $data_02
## Source: query [6 x 2]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 2
## a_02 b_02
## <dbl> <dbl>
## 1 0.3937457 0.34936496
## 2 0.0195079 0.74376380
## 3 0.9760512 0.00261368
## 4 0.4388773 0.70325800
## 5 0.9747534 0.40327283
## 6 0.6054003 0.53224218
##
## $data_03
## Source: query [6 x 3]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 3
## a_03 b_03 c_03
## <dbl> <dbl> <dbl>
## 1 0.59512263 0.2615939 0.592753768
## 2 0.72292799 0.7287428 0.003926143
## 3 0.51846687 0.3641869 0.874463146
## 4 0.01174093 0.9648346 0.177722575
## 5 0.86250126 0.3891915 0.857614579
## 6 0.33082723 0.2633013 0.233822140
A particularly slick trick is to expand the columns column into a taller table that allows us to quickly identify which columns are in which tables.
columnDictionary <- function(tableCollection) {
tableCollection$columns <-
lapply(tableCollection$handle,
colnames)
columnMap <- tableCollection %>%
select(tableName, columns) %>%
unnest(columns)
columnMap
}
columnMap <- columnDictionary(tableCollection)
print(columnMap)
## # A tibble: 6 x 2
## tableName columns
## <chr> <chr>
## 1 data_01 a_01
## 2 data_02 a_02
## 3 data_02 b_02
## 4 data_03 a_03
## 5 data_03 b_03
## 6 data_03 c_03
The idea is: place all of the above functions into a shared script or package, and then use them to organize loading your Spark
data references. With this practice you will have much less "spaghetti code", better document intent, and have a versatile workflow.
The principles we are using include:
- Keep configuration out of code (i.e., maintain the file list in a spreadsheet). This makes working with others much easier.
- Treat configuration as data (i.e., make sure the configuration is a nice regular table so that you can use
R
tools such astidyr::unnest()
to work with it).
转自:http://www.win-vector.com/blog/2017/05/managing-spark-data-handles-in-r/
Managing Spark data handles in R的更多相关文章
- 7 Tools for Data Visualization in R, Python, and Julia
7 Tools for Data Visualization in R, Python, and Julia Last week, some examples of creating visualiz ...
- Managing Hierarchical Data in MySQL
Managing Hierarchical Data in MySQL Introduction Most users at one time or another have dealt with h ...
- An Introduction to Stock Market Data Analysis with R (Part 1)
Around September of 2016 I wrote two articles on using Python for accessing, visualizing, and evalua ...
- Best packages for data manipulation in R
dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...
- sc.textFile("file:///home/spark/data.txt") Input path does not exist解决方法——submit 加参数 --master local 即可解决
use this val data = sc.textFile("/home/spark/data.txt") this should work and set master as ...
- Data Science With R In Visual Studio
R Projects Similar to Python, when we installed the data science tools we get an “R” section in our ...
- mysql 树形数据,层级数据Managing Hierarchical Data in MySQL
原文:http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/ 引言 大多数用户都曾在数据库中处理过分层数据(hiera ...
- Managing Hierarchical Data in MySQL(邻接表模型)[转载]
原文在:http://dev.mysql.com/tech-resources/articles/hierarchical-data.html 来源: http://www.cnblogs.com/p ...
- Java生成-zipf分布的数据集(自定义倾斜度,用作spark data skew测试)
1.代码 import java.io.Serializable; import java.util.NavigableMap; import java.util.Random; import jav ...
随机推荐
- Git命令行和Xcode结合使用
现在一直使用Git来管理代码,对于有强迫症的我来说,依旧选择了命令行,下面这段话可以更好的解释我为什么喜欢使用终端敲命令. There are a lot of different ways to u ...
- ASP.NET CORE部署到Linux
ASP.NET CORE部署到CentOS中 在Linux上安装.NET Core 参考:https://www.microsoft.com/net/core#linuxcentos 配置Nginx ...
- 用eclipes 添加jboss tools中的hibernate tool进行反向工程生成数据库对应的BOJO(Javabean)
用eclipes 添加jboss tools中的hibernate tool进行反向工程生成数据库对应的BOJO(Javabean) 安装: 在help中eclise marksplace中查询JBo ...
- js 数组方法总结
Array数组: length属性 可通过array.length增加或者减少数组的长度,如;array.length=4(数组长3,第四位为undefined),也可单纯获得长度.array[arr ...
- [.NET] 《C# 高效编程》(一) - C# 语言习惯
C# 语言习惯 目录 一.使用属性而不是可访问的数据成员 二.使用运行时常量(readonly)而不是编译时常量(const) 三.推荐使用 is 或 as 操作符而不是强制类型转换 四.使用 Con ...
- Let's Encrypt: 为CentOS/RHEL 7下的nginx安装https支持-具体案例
环境说明: centos 7 nginx 1.10.2 前期准备 软件安装 yum install -y epel-release yum install -y certbot 创建目录及链接 方法1 ...
- Linq 查询与普通查询的区别
普通:select * --1 from User(表名) as u --2 where u.Name like '%s%' --3 Linq : from User(表名) as u --1 whe ...
- summerDao-比mybatis更强大无需映射配置的dao工具
summerDao是summer框架中的一个数据库操作工具,项目地址:http://git.oschina.net/xiwa/summer. 怎么比mybatis更强大,怎么比beetlsql更简单, ...
- VirtualBox实现内外网络互访问的配置
作者 jrl137824675 来源地址:http://www.2cto.com/os/201205/133370.html 环境: 宿主机操作系统 Windows XP s ...
- PHP 学习笔记(4)
声明类属性或方法为静态,就可以不实例化类而直接访问.静态属性不能通过一个类已实例化的对象来访问(但静态方法可以). PHP 5 支持抽象类和抽象方法.定义为抽象的类不能被实例化 使用接口(interf ...