Accessing data in Hadoop using dplyr and SQL
If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.
There are two methods for accessing data in Hadoop using dplyr and SQL.
ODBC
You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI, dplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:
library(odbc)
con <- dbConnect(odbc::odbc(),
driver = <driver>,
host = <host>,
dbname = <dbname>,
user = <user>,
password = <password>,
port = 10000)
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
dbDisconnect(con)
Spark
If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:
library(sparklyr)
con <- spark_connect(master = "yarn-client")
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
spark_disconnect(con)
转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
Accessing data in Hadoop using dplyr and SQL的更多相关文章
- 【Big Data】HADOOP集群的配置(二)
Hadoop集群的配置(二) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- java.lang.IllegalStateException:Couldn't read row 0, col -1 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data from it.
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.xxx...}: java.lang.IllegalSta ...
- java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data fr
Android中操作Sqlite遇到的错误:java.lang.IllegalStateException: Couldn't read row 1, col 0 from CursorWindow. ...
- android 出现Make sure the Cursor is initialized correctly before accessing data from it
Make sure the Cursor is initialized correctly before accessing data from it 详细错误是:java.lang.IllegalS ...
- [Big Data]从Hadoop到Spark的架构实践
摘要:本文则主要介绍TalkingData在大数据平台建设过程中,逐渐引入Spark,并且以Hadoop YARN和Spark为基础来构建移动大数据平台的过程. 当下,Spark已经在国内得到了广泛的 ...
- 【Big Data】HADOOP集群的配置(一)
Hadoop集群的配置(一) 摘要: hadoop集群配置系列文档,是笔者在实验室真机环境实验后整理而得.以便随后工作所需,做以知识整理,另则与博客园朋友分享实验成果,因为笔者在学习初期,也遇到不少问 ...
- 使用Red Gate Sql Data Compare 数据库同步工具进行SQL Server的两个数据库的数据比较、同步
Sql Data Compare 是比较两个数据库的数据是否相同.生成同步sql的工具. 这一款工具由Red Gate公司出品,我们熟悉的.NET Reflector就是这个公司推出的,它的SQLTo ...
- 举例说明:Hadoop vs. NoSql vs. Sql vs. NewSql
转自:http://blog.jobbole.com/86269/ 尽管层次数据库如今在大型机上依然被广泛使用,但关系数据库(RDBMS)(SQL)已经占领了数据库市场,并且表现的相当优异.我们存 ...
- 数据库原理及应用-SQL数据操纵语言(Data Manipulation Language)和嵌入式SQL&存储过程
2018-02-19 18:03:54 一.数据操纵语言(Data Manipulation Language) 数据操纵语言是指插入,删除和更新语言. 二.视图(View) 数据库三级模式,两级映射 ...
随机推荐
- bzoj 4443: [Scoi2015]小凸玩矩阵
Time Limit: 10 Sec Memory Limit: 128 MBSubmit: 149 Solved: 81[Submit][Status][Discuss] Description ...
- 如何生成ssh密钥对
答:执行以下命令即可,生成的密钥对在~/.ssh下,会生成两个文件,一个id_rsa和id_rsa.pub,前者是私钥,后者是公钥 ssh-keygen -t rsa -C "your_em ...
- 【图片下载-代码】java下载网络图片资源例子
/** * @Description 下载网络图片资源 * @param imageUrl 图片地址 * @return String 下载后的地址 * @author SUNBIN * @date ...
- python 去除字符串两端的引号
a='"srting"' print(a) b=eval(a) print(b) 输出 "srting" srting
- 何时使用MQ ?
何时使用MQmq作为一种基础中间件在互联网项目中有着大量的使用. 一种技术的产生自然是为了解决某种需求,通常来说是以下场景: 需要跨进程通信:B系统需要A系统的输出作为输入参数.当A系统的输出能力远远 ...
- MVVM特点、源(数据)与目标(如:控件等)的映射
数据(源,viewMode表示)与控件(目标)的完全映射, 在绑定之后,通过操作数据,改变控件显示效果.显示内容.状态等.
- Android手机无线adb
1.首先电脑,手机通过数据线链接电脑,然后通过adb devices 查看到已连接 2.输入:adb tcpip 5555 3.输入:adb connect 222.222.221.137:5555 ...
- 十款效果惊艳的Html案例(一)
http://www.html5tricks.com/10-html5-image-effect.html
- java读取PHP接口数据的实现方法(四)
PHP文件: ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 3 ...
- Uedit个人专注
Uedit个人专注 Windows Registry Editor Version 5.00 [HKEY_CLASSES_ROOT\*\Shell\Uedit] [HKEY_CLASSES_ROO ...