Spark Shell Examples

Spark Shell

Example 1 - Process Data from List:

scala> val pairs = sc.parallelize( List(

				("This", 2),

				("is", 3),

				("Spark", 5),

				("is", 3)

										) )

...

scala> pairs.collect().foreach(println)

(This,2)

(is,3)

(Spark,5)

(is,3)

// Reduce Pairs by Keys:

scala> val pair1 = pairs.reduceByKey((x,y) => x+y, 4)

...

scala> pair1.collect.foreach(println)

(Spark,5)

(is,6)

(This,2)

// Decrease values by 1:

scala> val pair2 = pairs.mapValues( x=>x-1 )

scala> pair2.collect.foreach(println)

(This,1)

(is,2)

(Spark,4)

(is,2)

// Group Values by Keys:

scala> pairs.groupByKey.collect().foreach(println)

(Spark,CompactBuffer(5))

(is,CompactBuffer(3, 3))

(This,CompactBuffer(2))

Example 2 - Process Data from Local Text File

// Create an RDD from local test file:

scala> val testFile = sc.textFile("File:///home/PATH_TO_SPARK_HOME/README.MD")

RDD transformation and action can now be applied on the textFile

// This will display the number of lines in this textFile:

scala> textFile.count()

// or simply:

scala> textFile.count

// Note: if no argument, no parenthesis needed

// This will display the first line:

scala> textFile.first

// Filter lines containing "Spark":

scala> val linesWithSpark = textFile.filter (

						line => line.contains("Spark")

					)

// or simply:

scala> val linesWithSpark = textFile.filter(_.contains ("Spark"))

// Note: underscore "_" means every element in textFile

// Collect the content of linesWithSpark:

scala> linesWithSpark.collect ()

// Print lines of content of linesWithSpark:

scala> linesWithSpark.foreach (println)

// Map each line to #terms in it:

scala> numOfTermsPerLine = textFile.map ( line => line.split(" ").size )

// or simply:

scala> numOfTermsPerLine = textFile.map ( _.split(" ").size )

// Aggregate the numOfTermsPerLine to the max #terms:

scala> numOfTermsPerLine.reduce ( (a, b) => if (a>b) a else b )

// or use package Math.max:

scala> import java.lang.Math

scala> numOfTermsPerLine.reduce ( (a, b) => Math.max(a, b))

// Convert RDD textFile to an 1-D array of terms:

scala> val terms = textFile.flatMap ( _.split(" ") )

// Convert RDD textFile to an 2-D array of lines of terms:

scala> val terms_ = textFile.map ( _.split(" ") )

// Calculate the vocabulary size in textFile:

scala> terms.distinct().count()

// or simply:

scala> terms.distinct.count

// Find longest line together with the length in textFile:

scala> val lineLengthPair = textFile.map (

		line => (line, line.length) )

scala> val lineWithMaxLength = lineLengthPair.reduce (

		(pair1, pair2) => if pair1._2 >= pair2._2 pair1 else pair2 )

// alternatively, in a concise way:

scala> val lineWithMaxLength = textfile.map (

		line => (line, line.length) ).reduce (

		(pair1, pair2) => if (pair1._2 >= pair2._2) pair1 else pair2 )

// Find out all lines with "Spark" along with line number (start with 0)

// and output with format <line_no: line_content>

scala> val lineIndexPair = textFile.zipWithIndex()

scala> val lineIndexPairWithSpark = lineIndexPair.filter (

		_._1.contains("Spark"))

scala> lineIndexPairWithSpark.foreach (

		pair => println ( pair._2 + ": " + pair._1 )

// alternatively, in a concise way:

scala> textFile.zipWithIndex().filter (

		_._1.contains("Spark")).foreach (

		pair => println(pair._2 + ": ", pair._1) )

Example 3 - Process Data from Local CSV file

Download CSV file by

wget --content-disposition https://webcms3.cse.unsw.edu.au/files/cc5bb4af124130f899cddad80af071f1ad478c3c8eb7440433291459bb603ff1/attachment

Define a name-field mapping for the CSV file

scala> val aucid 		= 0

scala> val bid 			= 1

scala> val bidtime 		= 2

scala> val bidder		= 3

scala> val bidderrate 	= 4

scala> val openbid 		= 5

scala> val price 		= 6

scala> val itemtype 	= 7

scala> val dtl 			= 8

// Create an RDD as a 2-D array from CSV file:

scala> val auctionRDD = sc.textFile("file:///home/PATH-TO-CSV-FILE/auction.csv")

						.map ( _.split(",") )

// Count total number of item types in the auction:

scala> auctionRDD.map ( _(itemtype).distinct.count )

// itemtype was previously defined as 7 to index 8th column

// Count total number of bids per itemtype:

scala> auctionRDD.map ( line => ( line(itemtype), 1 )

				.reduceByKey ( _ + _ , 4)

				.foreach( pair => println (pair._1 + "," + pair._2)

// Find maximum number of bids for each auction

scala> auctionRDD.map ( line => ( line(aucid), 1 ) )

				.reduceByKey ( _ + _ , 4)

				.reduce ( (pair1, pair2) => if ( pair1._2 >= pair2._2 ) pair1 else pair2 )

				._2

// Find top-5 most number of bids for each auction

scala> auctionRDD.map ( line => (line(aucid), 1) )

				.reduceByKey ( _ + _ , 4)

				.map ( _.swap )

				.sortByKey (false)

				.map ( _.swap )

				.take (5)

Example 4 - Word Count on HDFS Text File

Download & put data file to HDFS by:

wget --content-disposition https://webcms3.cse.unsw.edu.au/files/33c7707c8b646a686e33af7e2f2fc006b53ff8c13d8317976bd262d8c6daae66/attachment

hdfs dfs -put pg100.txt Input/

// Create an RDD from HDFS:

scala> val pg100RDD = sc.textFile ("hdfs://HOST-NAME:PORT/user/USER-NAME/Input/pg100.txt")

// Word count:

scala> pg100RDD.flapMap ( _.split(" ") )

			.map ( term => (term, 1) )

			.reduceByKey ( _ + _ , 3)

			.saveAsTextFile ( "OUTPUT-PATH" )

Example N - Spark Graph-X programming

# Download graph data tiny-graph.txt

$ wget --content-disposition https://webcms3.cse.unsw.edu.au/files/ae6f45a3d64c0b35a3bd4d0c2740cc673f000dc60ec17d0e882faf6c20f74509/attachment

// Import Graphx relavent classes:

scala> import org.apache.spark.graphx._

// Load graph data as RDD:

scala> val tinyGraphRDD = sc.textFile ("file:///home/PATH-TO-GRAPH-DATA/tiny-graph.txt")

// Convert raw data <index, srcVertex, destVertex, weight>

// into graphx readable edges:

scala> val edges = tinyGraphRDD.map ( _.split(" ") )

					.map ( line =>

							Edge ( line(1).toLong,

									line(2).toLong,

									line(3).toDouble

								 )

						)

// Create a graph:

scala> val graph = Graph.fromEdges[Double, Double] (edges, 0.0)

// Now the graph has been created,

// show the triplets of this graph:

scala> graph.triplets.collect.foreach ( println )

Written with StackEdit.

Spark Shell Examples的更多相关文章

Spark shell的原理
Spark shell是一个特别适合快速开发Spark原型程序的工具,可以帮助我们熟悉Scala语言.即使你对Scala不熟悉,仍然可以使用这个工具.Spark shell使得用户可以和Spark集群 ...
Spark:使用Spark Shell的两个示例
Spark:使用Spark Shell的两个示例 Python 行数统计 ** 注意: **使用的是Hadoop的HDFS作为持久层,需要先配置Hadoop 命令行代码 # pyspark >& ...
Spark源码分析之Spark Shell（上）
终于开始看Spark源码了,先从最常用的spark-shell脚本开始吧.不要觉得一个启动脚本有什么东东,其实里面还是有很多知识点的.另外,从启动脚本入手,是寻找代码入口最简单的方法,很多开源框架,其 ...
Spark源码分析之Spark Shell（下）
继上次的Spark-shell脚本源码分析,还剩下后面半段.由于上次涉及了不少shell的基本内容,因此就把trap和stty放在这篇来讲述. 上篇回顾:Spark源码分析之Spark Shell(上 ...
[Spark内核] 第36课：TaskScheduler内幕天机解密：Spark shell案例运行日志详解、TaskScheduler和SchedulerBackend、FIFO与FAIR、Task运行时本地性算法详解等
本課主題通过 Spark-shell 窥探程序运行时的状况 TaskScheduler 与 SchedulerBackend 之间的关系 FIFO 与 FAIR 两种调度模式彻底解密 Task 数据 ...
【原创 Hadoop&Spark 动手实践 5】Spark 基础入门，集群搭建以及Spark Shell
Spark 基础入门,集群搭建以及Spark Shell 主要借助Spark基础的PPT,再加上实际的动手操作来加强概念的理解和实践. Spark 安装部署理论已经了解的差不多了,接下来是实际动手实 ...
[Spark Core] Spark Shell 实现 Word Count
0. 说明在 Spark Shell 实现 Word Count RDD (Resilient Distributed dataset), 弹性分布式数据集. 示意图 1. 实现 1.1 分步实现 ...
Spark Shell简单使用
基础 Spark的shell作为一个强大的交互式数据分析工具,提供了一个简单的方式学习API.它可以使用Scala(在Java虚拟机上运行现有的Java库的一个很好方式)或Python.在Spark目 ...
02、体验Spark shell下RDD编程
02.体验Spark shell下RDD编程 1.Spark RDD介绍 RDD是Resilient Distributed Dataset,中文翻译是弹性分布式数据集.该类是Spark是核心类成员之 ...

随机推荐

Linux Shell常用技巧(十二)
二十三. Bash Shell编程: 1. 读取用户变量: read命令是用于从终端或者文件中读取输入的内建命令,read命令读取整行输入,每行末尾的换行符不被读入.在read命令后面,如果 ...
JNDI数据源（在Tomcat下配置JNDI多数据源实例）
一,添加数据库驱动包加入classpath. 这里我用到了oracle和mysql.所以由两个jar包:ojdbc14.jar和mysql-connector-java-5.1.13-bin.jar. ...
ios - 沙盒和NSBundle
沙盒 1.沙盒机制介绍 iOS中的沙盒机制是一种安全体系.每个iOS程序都有一个独立的文件系统(存储空间),而且只能在对应的文件系统中进行操作,此区域被称为沙盒.应用必须待在自己的沙盒里,其他应用不能 ...
java spring-WebSocket json参数传递与接收
Websocket原理(摘抄) 一.websocket与http WebSocket是HTML5出的东西(协议),也就是说HTTP协议没有变化,或者说没关系,但HTTP是不支持持久连接的(长连接,循环 ...
CentOS7.6离线安装Redis5.0.4
安装gcc-c++: 检查是否存在gcc-c++:rpm -qa|grep gcc-c++ 如果不存在就下载Linux-GC-C++文件: 访问镜像网站:http://mirrors.aliyun.c ...
node创建服务器
//引入核心模块 const http = require('http'); //创建服务器 http.createServer((req,res)=>{ }).listen(3000); // ...
css中可以继承的属性
声明 : 本文源于https://www.cnblogs.com/thislbq/p/5882105.html CSS中可以和不可以继承的属性一.无继承性的属性 1.display:规定元素应该 ...
简述对Vuex的理解
1.什么是Vuex: Vuex是一个专为Vue.js应用程序开发的状态管理模式. 2.使用Vuex的原因: 当我们遇到多个组件共享状 ...
Redis（四）：解析配置文件redis.conf
解析配置文件redis.conf目录导航: 它在哪 Units单位 INCLUDES包含 GENERAL通用 SNAPSHOTTING快照 REPLICATION复制 SECURITY安全 LIMIT ...
Redis 之武林大会 - 哨兵（Sentinel）
前言 Redis在出从复制的模式下,一旦主节点由于故障不能提供服务,需要人工降从节点晋升为主节点,同时还要通知应用方更新主节点的地址,在很多应用场景下,这样的故障处理方式是无法被接受的.不过幸运的是R ...

Spark Shell Examples

Spark Shell

Example 1 - Process Data from List:

Example 2 - Process Data from Local Text File

Example 3 - Process Data from Local CSV file

Example 4 - Word Count on HDFS Text File

Example N - Spark Graph-X programming

Spark Shell Examples的更多相关文章

随机推荐

热门专题