Spark 学习笔记：（二）编程指引（Scala版）

参考：　　http://spark.apache.org/docs/latest/programming-guide.html　　　　后面懒得翻译了，英文记的，以后复习时再翻。

摘要：每个Spark application包含一个driver program 来运行main 函数，在集群上进行各种并行操作。 RDD是Spark的核心。除了RDD，Spark的另一个抽象时并行操作中使用的两种 shared variables： broadcast variables和accumulators.

Spark’的shell ： bin/spark-shell ( Scala ) ; bin/pyspark ( Python ).

0.Linking with spark＝>Initialing spark=>programming=>submit

首先要创建一个SparkContext object, 来告诉 Spark 怎样接入一个集群（cluster），创建一个SparkContext之前还要先创建一个SparkConf object t包含application信息，如下.

val conf = new SparkConf().setAppName(appName).setMaster(master)

new SparkContext(conf)

appName是你的app在集群UI上的名字。master参数包括： Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.

PS：在Spark shell中, 一个特别的 SparkContext 已经创建好，名字为sc。想让自己的SparkContext 工作需要使用--master 命令。

Once you package your application into a JAR (for Java/Scala) or a set of .py or .zip files (for Python), the bin/spark-submit script lets you submit it to any supported cluster manager.

Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.

1.RDD

1）Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

val data = Array(1, 2, 3, 4, 5)

val distData = sc.parallelize(data)

Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

scala> val distFile = sc.textFile("data.txt")

distFile: RDD[String] = MappedRDD@1d4cee08

SparkContext’s textFile method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.Once created, distFile can be acted on by dataset operations.

2）RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory (also support disk, or replicated across multiple nodes) using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. Storage level: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, MEMORY_AND_DISK, DISK_ONLY...

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
- Anonymous function syntax, which can be used for short pieces of code.
- Static methods in a global singleton object. For example, you can define object MyFunctions and then pass MyFunctions.func1, as follows:

object MyFunctions {

  def func1(s: String): String = { ... }

}

myRdd.map(MyFunctions.func1)

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)), as long as you import org.apache.spark.SparkContext._ in your program to enable Spark’s implicit conversions. The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples if you import the conversions.

val lines = sc.textFile("data.txt")

val pairs = lines.map(s => (s, 1))

val counts = pairs.reduceByKey((a, b) => a + b)

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that is grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Operations which can cause a shuffle include repartition operations , ‘ByKey operations (except for counting) , and join operations.To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce.

常用operations：（具体查文档）

map, filter, flatMap, sample, union, intersection, distinct, groupByKey, reduceByKey, aggregateByKey, sortByKey, join, cartsian

常用actions：

reduce, collect, count, first, take, takeSample, takeOrdered, saveAsTextFile, countByKey, foreach

2.Shared Variables

As operations are lazy, read-write shared variables across tasks would be inefficient. So Spark provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value

res0: Array[Int] = Array(1, 2, 3)

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums.

scala> val accum = sc.accumulator(0, "My Accumulator")

accum: spark.Accumulator[Int] = 0

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

...

10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value

res2: Int = 10

之后读一些例子。强烈推荐：http://dongxicheng.org/framework-on-yarn/spark-scala-writing-application/

以及doc上LR的一个完整例子：

import org.apache.spark.SparkContext

import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0).cache()

val test = splits(1)

// Run training algorithm to build the model

val model = new LogisticRegressionWithLBFGS()

  .setNumClasses(10)

  .run(training)

// Compute raw scores on the test set.

val predictionAndLabels = test.map { case LabeledPoint(label, features) =>

  val prediction = model.predict(features)

  (prediction, label)

}

// Get evaluation metrics.

val metrics = new MulticlassMetrics(predictionAndLabels)

val precision = metrics.precision

println("Precision = " + precision)

// Save and load model

model.save(sc, "myModelPath")

val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

import org.apache.spark.SparkContext

import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0).cache()

val test = splits(1)

// Run training algorithm to build the model

val model = new LogisticRegressionWithLBFGS()

  .setNumClasses(10)

  .run(training)

// Compute raw scores on the test set.

val predictionAndLabels = test.map { case LabeledPoint(label, features) =>

  val prediction = model.predict(features)

  (prediction, label)

}

// Get evaluation metrics.

val metrics = new MulticlassMetrics(predictionAndLabels)

val precision = metrics.precision

println("Precision = " + precision)

// Save and load model

model.save(sc, "myModelPath")

val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

Spark 学习笔记：（二）编程指引（Scala版）的更多相关文章

Spark学习笔记——RDD编程
1.RDD——弹性分布式数据集(Resilient Distributed Dataset) RDD是一个分布式的元素集合,在Spark中,对数据的操作就是创建RDD.转换已有的RDD和调用RDD操作 ...
Spark学习笔记3（IDEA编写scala代码并打包上传集群运行）
Spark学习笔记3 IDEA编写scala代码并打包上传集群运行我们在IDEA上的maven项目已经搭建完成了,现在可以写一个简单的spark代码并且打成jar包上传至集群,来检验一下我们的sp ...
学习笔记(二)--->《Java 8编程官方参考教程（第9版）.pdf》:第七章到九章学习笔记
注:本文声明事项. 本博文整理者:刘军本博文出自于: <Java8 编程官方参考教程>一书声明:1:转载请标注出处.本文不得作为商业活动.若有违本之,则本人不负法律责任.违法者自负一切 ...
Spark学习笔记之SparkRDD
Spark学习笔记之SparkRDD 一. 基本概念 RDD(resilient distributed datasets)弹性分布式数据集. 来自于两方面 ① 内存集合和外部存储系统 ② ...
spark学习笔记总结-spark入门资料精化
Spark学习笔记 Spark简介 spark 可以很容易和yarn结合,直接调用HDFS.Hbase上面的数据,和hadoop结合.配置很容易. spark发展迅猛,框架比hadoop更加灵活实用. ...
NumPy学习笔记二
NumPy学习笔记二 <NumPy学习笔记>系列将记录学习NumPy过程中的动手笔记,前期的参考书是<Python数据分析基础教程 NumPy学习指南>第二版.<数学分 ...
Spark学习笔记0——简单了解和技术架构
目录 Spark学习笔记0--简单了解和技术架构什么是Spark 技术架构和软件栈 Spark Core Spark SQL Spark Streaming MLlib GraphX 集群管理器受 ...
[Firefly引擎][学习笔记二][已完结]卡牌游戏开发模型的设计
源地址:http://bbs.9miao.com/thread-44603-1-1.html 在此补充一下Socket的验证机制:socket登陆验证.会采用session会话超时的机制做心跳接口验证 ...
OD调试学习笔记7—去除未注册版软件的使用次数限制
OD调试学习笔记7—去除未注册版软件的使用次数限制本节使用的软件链接 (想自己试验下的可以下载) 一:破解的思路仔细观察一个程序,我们会发现,无论在怎么加密,无论加密哪里,这个程序加密的目的就是需 ...
Spark学习笔记2（spark所需环境配置
Spark学习笔记2 配置spark所需环境 1.首先先把本地的maven的压缩包解压到本地文件夹中,安装好本地的maven客户端程序,版本没有什么要求不需要最新版的maven客户端. 解压完成之后 ...

随机推荐

tinyMCE获取鼠标选中的值
今天遇到一个需求就是要在WordPress的文章编辑页面增加选中文字时要将选中值传递到弹出窗口中解决方法: 去网上找到了下面这段代码,可以获取到选中文本 { type: 'textbox', nam ...
【LeetCode】Unique Email Addresses(独特的电子邮件地址)
这道题是LeetCode里的第929道题. 题目要求: 每封电子邮件都由一个本地名称和一个域名组成,以 @ 符号分隔. 例如,在 alice@leetcode.com中, alice 是本地名称,而 ...
E. A Magic Lamp
E. A Magic Lamp Time Limit: 1000ms Case Time Limit: 1000ms Memory Limit: 32768KB 64-bit integer IO ...
hdu 1536 sg (dfs实现)
S-Nim Time Limit: 5000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others)Total Submi ...
iOS学习笔记13-网络(二)NSURLSession
在2013年WWDC上苹果揭开了NSURLSession的面纱,将它作为NSURLConnection的继任者.现在使用最广泛的第三方网络框架:AFNetworking.SDWebImage等等都使用 ...
【数位DP】bnuoj 52813 J. Deciphering Oracles
http://acm.bnu.edu.cn/v3/contest_show.php?cid=9208#problem/J [AC] #include<bits/stdc++.h> usin ...
Spring JdbcTemplate操作小结
Spring 提供了JdbcTemplate 来封装数据库jdbc操作细节: 包括: 数据库连接[打开/关闭] ,异常转义 ,SQL执行 ,查询结果的转换使用模板方式封装 jdbc数据库操作-固定流 ...
使用sudo，mvn command not found
一个简单的解决办法是,编辑你当前用户的 .bashrc 文件,添加下面这行内容: alias sudo="sudo env PATH=$PATH" 因为系统预装的 sudo 在编译 ...
Nk 1430 Divisors（因子数与质因数）
Time Limit: 5000 ms Memory Limit: 10000 kB Total Submit : 432 (78 users) Accepted Submit : 10 ...
Scrapy学习-7-数据存储至数据库
使用MySQL数据库存储安装mysql模块包 pip install mysqlclient 相关库文件 sudo apt-get install libmysqlclient-devel sudo ...

Spark 学习笔记：（二）编程指引（Scala版）

Spark 学习笔记：（二）编程指引（Scala版）的更多相关文章

随机推荐

热门专题