Accessing data in Hadoop using dplyr and SQL
If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr
. The dplyr
package has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr
to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.
There are two methods for accessing data in Hadoop using dplyr
and SQL.
ODBC
You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI
, dplyr
, and odbc
. Note that the dplyr
package may also reference the dbplyr
package to help translate R into specific variants of SQL. You can use the odbc
package to create a connection with Hadoop and run queries:
library(odbc)
con <- dbConnect(odbc::odbc(),
driver = <driver>,
host = <host>,
dbname = <dbname>,
user = <user>,
password = <password>,
port = 10000)
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
dbDisconnect(con)
Spark
If you are running Spark on Hadoop, you may also elect to use the sparklyr
package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr
package communicates with the Spark API to run SQL queries, and it also has a dplyr
backend. You can use sparklyr
to create a connect with Spark run queries:
library(sparklyr)
con <- spark_connect(master = "yarn-client")
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
spark_disconnect(con)
转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
Accessing data in Hadoop using dplyr and SQL的更多相关文章
- 【Big Data】HADOOP集群的配置(二)
Hadoop集群的配置(二) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- java.lang.IllegalStateException:Couldn't read row 0, col -1 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data from it.
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.xxx...}: java.lang.IllegalSta ...
- java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data fr
Android中操作Sqlite遇到的错误:java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. ...
- android 出现Make sure the Cursor is initialized correctly before accessing data from it
Make sure the Cursor is initialized correctly before accessing data from it 详细错误是:java.lang.IllegalS ...
- [Big Data]从Hadoop到Spark的架构实践
摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...
- 【Big Data】HADOOP集群的配置(一)
Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- 使用Red Gate Sql Data Compare 数据库同步工具进行SQL Server的两个数据库的数据比较、同步
Sql Data Compare 是比较两个数据库的数据是否相同.生成同步sql的工具. 这一款工具由Red Gate公司出品,我们熟悉的.NET Reflector就是这个公司推出的,它的SQLTo ...
- 举例说明:Hadoop vs. NoSql vs. Sql vs. NewSql
转自:http://blog.jobbole.com/86269/ 尽管层次数据库如今在大型机上依然被广泛使用,但关系数据库(RDBMS)(SQL)已经占领了数据库市场,并且表现的相当优异.我们存 ...
- 数据库原理及应用-SQL数据操纵语言(Data Manipulation Language)和嵌入式SQL&存储过程
2018-02-19 18:03:54 一.数据操纵语言(Data Manipulation Language) 数据操纵语言是指插入,删除和更新语言. 二.视图(View) 数据库三级模式,两级映射 ...
随机推荐
- 20145201李子璇 《网络对抗》 Web基础
1.实验后回答问题 (1)什么是表单 它在网页中主要负责数据采集功能,通过用户提交的一些数据来建立网站管理者与浏览者之间的桥梁. 两个组成部分:①HTML源代码用于描述表单(比如域,标签和浏览者在页面 ...
- impress.js初体验——前端装X利器
impress.js 是国外一位开发者受 Prezi 启发,采用 CSS3 与 JavaScript 语言完成的一个可供开发者使用的表现层框架(演示工具).其功能包括画布的无限旋转与缩放,任意角度放置 ...
- Educational Codeforces Round 47 D
Let's call an undirected graph $G=(V,E)$ relatively prime if and only if for each edge $(v,u)∈E$ $GC ...
- Redis之持久化
Redis 持久化 提供了多种不同级别的持久化方式:一种是RDB,另一种是AOF. RDB方式的持久化是通过快照(snapshotting)完成的,当符合一定条件时Redis会自动将内存中的所有数据进 ...
- AIM Tech Round 5 (rated, Div. 1 + Div. 2)
A. Find Square 找到对角线的两个点的坐标,这道题就迎刃而解了. inline void work(int n) { int m; cin >> m; memset(str, ...
- springboot2 统一返回结果
统一返回结果是说,不用在controller层,new一个对象,或用工厂创建这个对象,再返回这个对象,而是这个Action该返回什么就返回什么, 我们再通过mvc的流程,再对返回对象做进一步的封装,以 ...
- 97. Interleaving String *HARD* -- 判断s3是否为s1和s2交叉得到的字符串
Given s1, s2, s3, find whether s3 is formed by the interleaving of s1 and s2. For example,Given:s1 = ...
- 时间序列挖掘-预测算法-三次指数平滑法(Holt-Winters)——三次指数平滑算法可以很好的保存时间序列数据的趋势和季节性信息
from:http://www.cnblogs.com/kemaswill/archive/2013/04/01/2993583.html 在时间序列中,我们需要基于该时间序列当前已有的数据来预测其在 ...
- WEB标准以及W3C的理解和认识
web标准简单来说可以分为结构.表现和行为.其中结构主要是有HTML标签组成.表现即指css样式表,通过css可以是页面的结构标签更具美感.行为是指页面和用户具有一定的交互,同时页面结构或者表现发生变 ...
- gzip压缩初探
前言 我们平时工作中传文件时为了提高传输速度一般都会把文件压缩一下再传,那样速度回快一些,尤其是那些文件很多的文件夹,比较常用的压缩格式就是zip,rar了.那我们日常网页中利用http协议请求的那些 ...