Spark(3) - External Data Source
Introduction
Spark provides a unified runtime for big data. HDFS, which is Hadoop's filesystem, is the most used storage platform for Spark as it provides const-effefctive storage for unstructured and semi-structured data on commodity hardware. Spark is not limited to HDFS and can work with any Hadoop-supported storage.
Hadoop supported storage means a storage format that can work with Hadoop's InputFormat and OutputFormat interfaces. InputFormats is responsible for creating InputSplits from input data and dividing it further into records. OutputFormat is responsible for writing to storage.
Loading data from the local filesystem
Though the local filesystem is not a good fit to store big data due to disk size limitations and lack of distributed nature, technically you can load data in distributed systems using the local filesystem. But then the file/directory you are accessing has to be available on each node.
1. create the words directory
mkdir words
2. get into the words directory
cd words
3. create the sh.txt file
echo "to be or not to be" > sh.txt
4. start the spark shell
spark-shell
5. load the words directory as RDD
scala> val words = sc.textFile("file:///home/hduser/words")
6. count the number of lines
scala> words.count
7. divide the line (or lines) into multiple words
scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
8. convert word to (word,1)
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
9. add the number of occurrences for each word
scala > val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))
10. print the RDD
scala> wordCount.collect.foreach(println)
11. doing all in one step
scala> sc.textFile("file:///home/hduser/ words"). flatMap(_.split("\\W+")).map( w => (w,1)). reduceByKey( (a,b) => (a+b)).foreach(println)
Loading data from HDFS
HDFS is the most widely used big data storage system. One of the reasons for the wide adoption of HDFS is schema-on-read. What this means is that HDFS does not put any restriction on data when data is being written. Any and all kinds of data are welcome and can be stored in a raw format. This feature makes it ideal storage for raw unstructured data and semi-structured data.
1. create the words directory
mkdir words
2. get into the words directory
cd words
3. create the sh.txt file
echo "to be or not to be" > sh.txt
4. start the spark shell
spark-shell
5. load the words directory as RDD
scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/words")
6. count the number of lines
scala> words.count
7. divide the line (or lines) into multiple words
scala> val wordsFlatMap = words.flatMap(_.split("\\W+"))
8. convert word to (word,1)
scala> val wordsMap = wordsFlatMap.map( w => (w,1))
9. add the number of occurrences for each word
scala > val wordCount = wordsMap.reduceByKey( (a,b) => (a+b))
10. print the RDD
scala> wordCount.collect.foreach(println)
11. doing all in one step
scala> sc.textFile("file:///home/hduser/ words"). flatMap(_.split("\\W+")).map( w => (w,1)). reduceByKey( (a,b) => (a+b)).foreach(println)
Loading data from HDFS using a custom InputFormat
Sometimes you need to load data in a specific format and TextInputFormat is not a good fit for that. Spark provides two methods for this purpose:
1. sparkContext.hadoopFile: This supports the old MapReduce API
2. sparkContext.newAPIHadoopFile: This supports the new MapReduce API
These two methods provide support for all of Hadoop's built-in InputFormats interfaces as well as any custom InputFormat.
1. create the currency directory
mkdir currency
2. get into the words directory
cd words
3. create the na.txt file and upload the currency folder to HDFS
vi na.txt
United States of America US Dollar
Canada Canadian Dollar
Mexico Peso
hdfs dfs -put currency /user/hduser/currency
4. start the spark shell and import statements
spark-shell
scala> import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat
5. load the currency directory as RDD and convert it from tuple of (Text, Text) to tuple of (String, String)
val currencyFile = sc.newAPIHadoopFile("hdfs://localhost:9000/user/hduser/currency", classOf[KeyValueTextInputFormat], classOf[Text])
val currencyRDD = currencyFile.map(t => (t._1.toString, t._2.toString))
6. count the number of elements in the RDD
scala> currencyRDD.count
7. print the RDD
scala> currencyRDD.collect.foreach(println)
Loading data from Amazon S3
Amazon Simple Storage Service (S3) provides developers and IT teams with a secure, durable, and scalable storage platform. The biggest advantage of Amazon S3 is that there is no up-front IT investment and companies can build capacity (just by clicking a button a button) as they need.
Though Amazon S3 can be used with any compute platform, it integrates really well with Amazon's cloud services such as Amazon Elastic Compute Cloud (EC2) and Amazon Elastic Block Storage (EBS). For this reason, companies who use Amazon Web Services (AWS) are likely to have significant data is already stored on Amazon S3.
1. go to http://aws.amazon.com and log in with username and password
2. navigate to Storage & Content Delivery | S3 | Create Bucket
3. enter the bucket name - for example, com.infoobjects.wordcount
4. select Region, click on Create
5. click on Create Folder end enter words as the folder name
6. create sh.txt file on the local system
echo "to be or not to be" > sh.txt
7. navigate to Words | Upload | Add Files and choose sh.txt from the dialog box
8. click on Start Upload
9. select sh.txt and click on Properties
10. set AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY as environment variables
11. open the spark shell and load the words directory from s3 in the words RDD
scala> val words = sc.textFile("s3n://com.infoobjects.wordcount/words")
Load data from Apache Cassandra
Apache Cassandra is a NoSQL database with a masterless ring cluster structure. While HDFS is a good fit for streaming data access, it does not work well with random access. For example, HDFS will work well when your average file size is 100 MB and you want to read the whole file. If you frequently access the nth line in a file or some other part as a record, HDFS would be too slow.
Relational databases have traditionally provided a solution to that, providing low latency, random access, but they do not work well with big data. NoSQL databases such as Cassandra fill the gap by providing relational database type access but in a distributed architecture on commodity servers.
1. create a keyspace named people in Cassandra using the CQL shell
cqlsh> CREATE KEYSPACE people WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
2. create a column family(from CQL 3.0 onwords, it can also be called a table) person in newer version of Cassandra
cqlsh> create columnfamily person(id int primary key, first_name varchar, last_name varchar);
3. insert a few records in the column family
cqlsh> insert into person(id,first_name,last_name) values(1,'Barack','Obama');
cqlsh> insert into person(id,first_name,last_name) values(2,'Joe','Smith');
4. add Cassandra connector dependency to SBT
"com.datastax.spark" % "spark-cassandra-connector" % 1.2.0
5. can also add the Cassandra dependency to Maven
<dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.2.0</version>
</dependency>
6. start the spark shell
spark-shell
7. set the spark.cassandra.connection.host property
scala> sc.getConf.set("spark.cassandra.connection.host", "localhost")
8. import Cassandra-specific libraries
scala> import com.datastax.spark.connector._
9. load the person column family as an RDD
scala> val personRDD = sc.cassandraTable("people", "person")
10. count the number of lines
scala> personRDD.count
11. print the RDD
scala> personRDD.collect.foreach(println)
12. retrieve the first row
scala> var firstRow = personRDD.first
13. get the column names
scala> firstRow.columnNames
14. access Cassandra through Spark SQL
scala> val cc = new org.apache.spark.sql.cassandra.CassandraSQLContext(sc)
15. load the person data as SchemaRDD
scala> val p = cc.sql("select * from people.person")
16. print the person data
scala> p.collect.foreach(println)
creating uber JARs with sbt-assembly plugin provided by SBT
1. mkdir uber
2. cd uber
3. open the SBT prompt
sbt
4. give the project a name sc-uber, save the session and exit
> set name := "sc-uber"
> session save
> exit
5. add the spark-cassandra-driver denpendency to build.sbt
vi build.sbt
name := "sc-uber"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.1.0"
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => (xs map {_.toLowerCase}) match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard
case _ => MergeStrategy.discard
}
case _ => MergeStrategy.first
}
9. create plugins.sbt in the project folder
vi plugins.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
10. build a JAR
sbt assembly
The uber JAR is now created in target/scala-2.10/sc-uber-assembly-0.1- SNAPSHOT.jar.
11. rename the JAR file
mv thirdparty/sc-uber-assembly-0.1-SNAPSHOT.jar thirdparty/sc-uber.jar
12. load the spark shell with the uber JAR
spark-shell --jars thirdparty/sc-uber.jar
13. call spark-submit with JARS option to submit Scala code to a cluster
spark-submit --jars thirdparty/sc-uber.jar
Spark(3) - External Data Source的更多相关文章
- Spark SQL External Data Sources JDBC官方实现写测试
通过Spark SQL External Data Sources JDBC实现将RDD的数据写入到MySQL数据库中. jdbc.scala重要API介绍: /** * Save this RDD ...
- Spark SQL External Data Sources JDBC简易实现
在spark1.2版本中最令我期待的功能是External Data Sources,通过该API可以直接将External Data Sources注册成一个临时表,该表可以和已经存在的表等通过sq ...
- Spark SQL External Data Sources JDBC官方实现读测试
在最新的master分支上官方提供了Spark JDBC外部数据源的实现,先尝为快. 通过spark-shell测试: import org.apache.spark.sql.SQLContext v ...
- Searching External Data in SharePoint 2010 Using Business Connectivity Services
from:http://blogs.msdn.com/b/ericwhite/archive/2010/04/28/searching-external-data-in-sharepoint-2010 ...
- Spark Streaming、Kafka结合Spark JDBC External DataSouces处理案例
场景:使用Spark Streaming接收Kafka发送过来的数据与关系型数据库中的表进行相关的查询操作: Kafka发送过来的数据格式为:id.name.cityId,分隔符为tab zhangs ...
- Spark Streaming、HDFS结合Spark JDBC External DataSouces处理案例
场景:使用Spark Streaming接收HDFS上的文件数据与关系型数据库中的表进行相关的查询操作: 使用技术:Spark Streaming + Spark JDBC External Data ...
- Spark SQL 之 Data Sources
#Spark SQL 之 Data Sources 转载请注明出处:http://www.cnblogs.com/BYRans/ 数据源(Data Source) Spark SQL的DataFram ...
- 以Excel 作为Data Source,将data导入db
将Excel作为数据源,将数据导入db,是SSIS的一个简单的应用,下图是示例Excel,数据列是code和name 第一部分,Excel中的数据类型是数值类型 1,使用SSDT创建一个package ...
- Learning Spark: Lightning-Fast Big Data Analysis 中文翻译
Learning Spark: Lightning-Fast Big Data Analysis 中文翻译行为纯属个人对于Spark的兴趣,仅供学习. 如果我的翻译行为侵犯您的版权,请您告知,我将停止 ...
随机推荐
- LINQ 如何实现 in 与 not in
T-SQL的IN: Select ProductID, ProductName, CategoryID From dbo.Products Where CategoryID , ) T-SQL的NOT ...
- 【T-SQL系列】FOR XML PATH 语句的应用
DECLARE @TempTable TABLE ( UserID INT , UserName ) ); INSERT INTO @TempTable ( UserID, UserName ) , ...
- MyEclipse + Tomcat 热部署问题
myEclipse设置对应的tomcat时,只需要在jdk的Optional Java VM arguments中添加如下设置: -Xms256m -Xmx512m-Dcom.sun.manageme ...
- DBCP、C3P0、Proxool 、 BoneCP开源连接池的比《转》
简介 使用评价 项目主页 DBCP DBCP是一个依赖Jakarta commons-pool对象池机制的数据库连接池.DBCP可以直接的在应用程序用使用 可以设置最大和最小连接,连接等待时 ...
- spring常用的工具类
spring给我们提供了很多的工具类, 应该在我们的日常工作中很好的利用起来. 它可以大大的减轻我们的平时编写代码的长度. 因我们只想用spring的工具类, 而不想把一个大大的spring工程给引入 ...
- Matlab中的persistent变量
persistent, 用于定义persistent变量.persistent变量对于声明它的函数来说是局部的,但是当退出该函数时,该变量仍然保存在内存中,数值并不变.persistent变量与全局变 ...
- AJAX的简介
AJAX 指异步JavaScript及XML(Asynchronous JavaScript And XML(异步JavaScript和XML)),是指一种创建交互式网页应用的网页开发技术. 国内通常 ...
- 手把手ssm+idea
https://github.com/judasn/Basic-Multi-Module-SSM https://github.com/liyifeng1994/ssm
- [转]Android_气泡效果
最近在看以前在eoe上收藏的一些源代码,准备将这些代码加上一些自己的注释,然后贴出来,方便自己日后查阅,和刚入门的人来学习. 今天先看一个气泡窗口,先看一下效果图和目录结构,然后再上代码 通过第一幅图 ...
- 有用的dede表单代码
<form action="" class="demoform"> <table> ...