spark之scala程序开发(本地运行模式):单词出现次数统计
准备工作:
将运行Scala-Eclipse的机器节点(CloudDeskTop)内存调整至4G,因为需要在该节点上跑本地(local)Spark程序,本地Spark程序会启动Worker进程耗用大量内存资源
本地运行模式(主要用于调试)
1、首先将Spark的所有jar包拷贝到hadoop用户家目录下
[hadoop@CloudDeskTop spark-2.1.1]$ pwd
/software/spark-2.1.1
[hadoop@CloudDeskTop spark-2.1.1]$ cp -a jars /home/hadoop/bigdata/lib/Spark2.1.1-All
[hadoop@CloudDeskTop spark-2.1.1]$ ls ~/bigdata/lib
Hadoop2.7.3-All HBase1.2.6-All RPC Spark2.1.1-All
2、在Scala4.4的Eclipse版本中,新建一个Scala的工程
然后在Eclipse中创建一个Spark2.1.1-All的用户库,将~/Spark2.1.1-All目录中的所有jar包导入Spark2.1.1-All用户库中,并将此用户库添加到当前的Scala工程中:Window==>Preferences==>Java==>Build Path==>User Libraries==>New
3、在本地创建测试数据
[hadoop@CloudDeskTop scala]$ pwd
/home/hadoop/test/scala
[hadoop@CloudDeskTop scala]$ ls
information information02
[hadoop@CloudDeskTop scala]$ cat information
zhnag san shi yi ge hao ren
jin tian shi yi ge hao tian qi
wo zai zhe li zuo le yi ge ce shi
yi ge guan yu scala de ce shi
welcome to mmzs
欢迎 欢迎
[hadoop@CloudDeskTop scala]$ cat information02
zhnag san shi yi ge hao ren
jin tian shi yi ge hao tian qi
wo zai zhe li zuo le yi ge ce shi
yi ge guan yu scala de ce shi
welcome to mmzs
欢迎 欢迎
4、创建一个带main方法的入口类
File—>New—>Scala Object—>在弹出对话框的Name栏输入:com.mmzs.bigdata.spark.rdd.local.SparkRDD,其中SparkRDD是类名、com.mmzs.bigdata.spark.rdd.local是包名
object WordCount {
//建立一个main方法
def main(args: Array[String]): Unit = {
......
}
}
完整的代码如下:
package com.mmzs.bigdata.spark.rdd.local import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD.rddToOrderedRDDFunctions
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions object SparkRDD {
def main(args: Array[String]): Unit = {
//读取spark配置文件
val conf:SparkConf = new SparkConf();
//本地测试模式
conf.setMaster("local");//Local Scala Spark RDD
//设定应用名字
conf.setAppName("sparkTestLocal"); //根据配置获取spark上下文对象
val sc:SparkContext = new SparkContext(conf); //使用sparkContext读取文件到内存并生成RDD对象,
//指定一个输入目录即可,目录中的所有文件都将作为输入文件
val lineRdd:RDD[String] = sc.textFile("/home/hadoop/test/scala");
//使用空格切分每一行的数据为单词数组,并将单词数组中的单词子串释放到外层的RDD集合中
val flatRdd:RDD[String] = lineRdd.flatMap { line => line.split(" "); }
//将RDD中的每一个单词字串转化为元组,以完成单词计数
val mapRDD:RDD[Tuple2[String, Integer]] = flatRdd.map(word=>(word,1));
//按RDD集合中每一个元组的第一个元素(即单词字串)进行分组并完成单词计数
val reduceRDD:RDD[(String, Integer)] = mapRDD.reduceByKey((pre:Integer, next:Integer)=>pre+next);
//交换元素中的key和value的位置便于后续排序
val reduceRDD02:RDD[(Integer, String)] = reduceRDD.map(tuple=>(tuple._2,tuple._1));
//根据key进行排序,第二个参数表示启动的Task数量,设置大了可能会抛出内存溢出的异常
val sortRDD:RDD[(Integer, String)] = reduceRDD02.sortByKey(false, 1);
//排好序之后将顺序换回来
val sortRDD02:RDD[(Integer, String)] = reduceRDD.map(tuple=>(tuple._2,tuple._1)); //指定一个输出目录,如果输出目录已经事先存在则应该将它删除掉
//val f:File=new File("/home/hadoop/test/scala/output");
//if(f.exists()&&f.isDirectory()) f.delete();
//将结果保存到指定路径下
reduceRDD.saveAsTextFile("/home/hadoop/test/scala/output/"); //停止使用spark上下文对象
sc.stop();
}
}
5、运行结果如下:
[hadoop@CloudDeskTop scala]$ ll
总用量 8
-rw-rw-r-- 1 hadoop hadoop 156 2月 7 15:27 information
-rw-rw-r-- 1 hadoop hadoop 156 2月 7 15:29 information02
[hadoop@CloudDeskTop scala]$ ll
总用量 12
-rw-rw-r-- 1 hadoop hadoop 156 2月 7 15:27 information
-rw-rw-r-- 1 hadoop hadoop 156 2月 7 15:29 information02
drwxrwxr-x 2 hadoop hadoop 4096 2月 7 15:54 output
[hadoop@CloudDeskTop scala]$ cd output/
[hadoop@CloudDeskTop output]$ ls
part-00000 part-00001 _SUCCESS
[hadoop@CloudDeskTop output]$ cat part-00000
(scala,2)
(zuo,2)
(tian,4)
(shi,8)
(ce,4)
(zai,2)
(欢迎,4)
(wo,2)
(zhnag,2)
(san,2)
(welcome,2)
(yi,8)
(ge,8)
(hao,4)
(qi,2)
(yu,2)
[hadoop@CloudDeskTop output]$ cat part-00001
(guan,2)
(jin,2)
(ren,2)
(de,2)
(le,2)
(to,2)
(zhe,2)
(li,2)
(mmzs,2)
在Scala4.4的Eclipse的控制台显示信息如下:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/02/07 15:54:12 INFO SparkContext: Running Spark version 2.1.1
18/02/07 15:54:12 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
18/02/07 15:54:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/07 15:54:14 INFO SecurityManager: Changing view acls to: hadoop
18/02/07 15:54:14 INFO SecurityManager: Changing modify acls to: hadoop
18/02/07 15:54:14 INFO SecurityManager: Changing view acls groups to:
18/02/07 15:54:14 INFO SecurityManager: Changing modify acls groups to:
18/02/07 15:54:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/02/07 15:54:14 INFO Utils: Successfully started service 'sparkDriver' on port 37639.
18/02/07 15:54:14 INFO SparkEnv: Registering MapOutputTracker
18/02/07 15:54:14 INFO SparkEnv: Registering BlockManagerMaster
18/02/07 15:54:14 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/02/07 15:54:14 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/02/07 15:54:14 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b7a711cf-321e-43af-95ca-c504982e56e4
18/02/07 15:54:14 INFO MemoryStore: MemoryStore started with capacity 348.0 MB
18/02/07 15:54:15 INFO SparkEnv: Registering OutputCommitCoordinator
18/02/07 15:54:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/02/07 15:54:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.154.134:4040
18/02/07 15:54:15 INFO Executor: Starting executor ID driver on host localhost
18/02/07 15:54:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59713.
18/02/07 15:54:15 INFO NettyBlockTransferService: Server created on 192.168.154.134:59713
18/02/07 15:54:15 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/02/07 15:54:15 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.154.134, 59713, None)
18/02/07 15:54:15 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.154.134:59713 with 348.0 MB RAM, BlockManagerId(driver, 192.168.154.134, 59713, None)
18/02/07 15:54:15 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.154.134, 59713, None)
18/02/07 15:54:15 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.154.134, 59713, None)
18/02/07 15:54:17 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 193.9 KB, free 347.8 MB)
18/02/07 15:54:17 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.9 KB, free 347.8 MB)
18/02/07 15:54:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.154.134:59713 (size: 22.9 KB, free: 348.0 MB)
18/02/07 15:54:17 INFO SparkContext: Created broadcast 0 from textFile at SparkRDD.scala:24
18/02/07 15:54:18 INFO FileInputFormat: Total input paths to process : 2
18/02/07 15:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/02/07 15:54:18 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/02/07 15:54:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/02/07 15:54:18 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/02/07 15:54:18 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/02/07 15:54:18 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/07 15:54:18 INFO SparkContext: Starting job: saveAsTextFile at SparkRDD.scala:42
18/02/07 15:54:18 INFO DAGScheduler: Registering RDD 3 (map at SparkRDD.scala:28)
18/02/07 15:54:18 INFO DAGScheduler: Got job 0 (saveAsTextFile at SparkRDD.scala:42) with 2 output partitions
18/02/07 15:54:18 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at SparkRDD.scala:42)
18/02/07 15:54:18 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/02/07 15:54:18 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/02/07 15:54:18 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at SparkRDD.scala:28), which has no missing parents
18/02/07 15:54:18 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 347.8 MB)
18/02/07 15:54:18 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 347.8 MB)
18/02/07 15:54:18 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.154.134:59713 (size: 2.5 KB, free: 348.0 MB)
18/02/07 15:54:18 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
18/02/07 15:54:18 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at SparkRDD.scala:28)
18/02/07 15:54:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/02/07 15:54:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5984 bytes)
18/02/07 15:54:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/02/07 15:54:18 INFO HadoopRDD: Input split: file:/home/hadoop/test/scala/information:0+156
18/02/07 15:54:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1748 bytes result sent to driver
18/02/07 15:54:19 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5986 bytes)
18/02/07 15:54:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/02/07 15:54:19 INFO HadoopRDD: Input split: file:/home/hadoop/test/scala/information02:0+156
18/02/07 15:54:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 527 ms on localhost (executor driver) (1/2)
18/02/07 15:54:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1661 bytes result sent to driver
18/02/07 15:54:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 137 ms on localhost (executor driver) (2/2)
18/02/07 15:54:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/07 15:54:19 INFO DAGScheduler: ShuffleMapStage 0 (map at SparkRDD.scala:28) finished in 0.649 s
18/02/07 15:54:19 INFO DAGScheduler: looking for newly runnable stages
18/02/07 15:54:19 INFO DAGScheduler: running: Set()
18/02/07 15:54:19 INFO DAGScheduler: waiting: Set(ResultStage 1)
18/02/07 15:54:19 INFO DAGScheduler: failed: Set()
18/02/07 15:54:19 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at saveAsTextFile at SparkRDD.scala:42), which has no missing parents
18/02/07 15:54:19 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 72.4 KB, free 347.7 MB)
18/02/07 15:54:19 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 26.1 KB, free 347.7 MB)
18/02/07 15:54:19 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.154.134:59713 (size: 26.1 KB, free: 347.9 MB)
18/02/07 15:54:19 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:996
18/02/07 15:54:19 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at saveAsTextFile at SparkRDD.scala:42)
18/02/07 15:54:19 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/02/07 15:54:19 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, ANY, 5757 bytes)
18/02/07 15:54:19 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
18/02/07 15:54:19 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/02/07 15:54:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms
18/02/07 15:54:19 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/07 15:54:19 INFO FileOutputCommitter: Saved output of task 'attempt_20180207155418_0001_m_000000_2' to file:/home/hadoop/test/scala/output/_temporary/0/task_20180207155418_0001_m_000000
18/02/07 15:54:19 INFO SparkHadoopMapRedUtil: attempt_20180207155418_0001_m_000000_2: Committed
18/02/07 15:54:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1977 bytes result sent to driver
18/02/07 15:54:19 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, executor driver, partition 1, ANY, 5757 bytes)
18/02/07 15:54:19 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
18/02/07 15:54:19 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 298 ms on localhost (executor driver) (1/2)
18/02/07 15:54:19 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/02/07 15:54:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/02/07 15:54:19 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/07 15:54:19 INFO FileOutputCommitter: Saved output of task 'attempt_20180207155418_0001_m_000001_3' to file:/home/hadoop/test/scala/output/_temporary/0/task_20180207155418_0001_m_000001
18/02/07 15:54:19 INFO SparkHadoopMapRedUtil: attempt_20180207155418_0001_m_000001_3: Committed
18/02/07 15:54:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1890 bytes result sent to driver
18/02/07 15:54:19 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at SparkRDD.scala:42) finished in 0.437 s
18/02/07 15:54:19 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 143 ms on localhost (executor driver) (2/2)
18/02/07 15:54:19 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/02/07 15:54:19 INFO DAGScheduler: Job 0 finished: saveAsTextFile at SparkRDD.scala:42, took 1.566889 s
18/02/07 15:54:19 INFO SparkUI: Stopped Spark web UI at http://192.168.154.134:4040
18/02/07 15:54:19 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/07 15:54:19 INFO MemoryStore: MemoryStore cleared
18/02/07 15:54:19 INFO BlockManager: BlockManager stopped
18/02/07 15:54:19 INFO BlockManagerMaster: BlockManagerMaster stopped
18/02/07 15:54:19 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/07 15:54:19 INFO SparkContext: Successfully stopped SparkContext
18/02/07 15:54:19 INFO ShutdownHookManager: Shutdown hook called
18/02/07 15:54:19 INFO ShutdownHookManager: Deleting directory /tmp/spark-1ce79d12-6eb7-403c-b9e8-3d836d821c90
Console界面显示的信息:
6、说明
如果master主机设置为local则表示本地运行,本地运行时Spark会启动一个集群模拟器来运行Job作业,本地运行只能在Eclipse、或直接使用java命令等运行它,不能使用spark-submit来提交运行,因为该命令会装载Spark环境配置并连接到Spark集群,对于本地模式下是不需要连接集群的,本地模式仅仅适合于在本地开发环境测试代码。
使用scala命令运行本地模式:
Scala使用命令行方式运行jar包还存在一些bug(不能使用java命令来运行scala开发的jar包),主要体现在无法协调处理类路径classpath的导入问题
spark之scala程序开发(本地运行模式):单词出现次数统计的更多相关文章
- 2 weekend110的mapreduce介绍及wordcount + wordcount的编写和提交集群运行 + mr程序的本地运行模式
把我们的简单运算逻辑,很方便地扩展到海量数据的场景下,分布式运算. Map作一些,数据的局部处理和打散工作. Reduce作一些,数据的汇总工作. 这是之前的,weekend110的hdfs输入流之源 ...
- spark之scala程序开发(集群运行模式):单词出现次数统计
准备工作: 将运行Scala-Eclipse的机器节点(CloudDeskTop)内存调整至4G,因为需要在该节点上跑本地(local)Spark程序,本地Spark程序会启动Worker进程耗用大量 ...
- spark之java程序开发
spark之java程序开发 1.Spark中的Java开发的缘由: Spark自身是使用Scala程序开发的,Scala语言是同时具备函数式编程和指令式编程的一种混血语言,而Spark源码是基于Sc ...
- Spark On Yarn搭建及各运行模式说明
之前记录Yarn:Hadoop2.0之YARN组件,这次使用Docker搭建Spark On Yarn 一.各运行模式 1.单机模式 该模式被称为Local[N]模式,是用单机的多个线程来模拟Spa ...
- scala程序开发入门
scala程序开发入门,快速步入scala的门槛: 1.Scala的特性: A.纯粹面向对象(没有基本类型,只有对象类型).Scala的安装与JDK相同,只需要解压之后配置环境变量即可:B.Scala ...
- hadoop本地运行模式调试
一:简介 最近学习hadoop本地运行模式,在运行期间遇到一些问题,记录下来备用:以运行hadoop下wordcount为例子. hadoop程序是在集群运行还是在本地运行取决于下面两个参数的设置,第 ...
- Spark on YARN的两种运行模式
Spark on YARN有两种运行模式,如下 1.yarn-cluster:适合于生产环境. Spark的Driver运行在ApplicationMaster中,它负责向YARN Re ...
- Spark Client和Cluster两种运行模式的工作流程
1.client mode: In client mode, the driver is launched in the same process as the client that submits ...
- Storm的本地运行模式示例
以word count为例,本地化运行模式(不需要安装zookeeper.storm集群),maven工程, pom.xml文件如下: <project xmlns="http://m ...
随机推荐
- 【转】RPC介绍
转自:http://www.cnblogs.com/Vincentlu/p/4185299.html 摘要: RPC——Remote Procedure Call Protocol,这是广义上的解释, ...
- Flask框架(一)
from flask import Flask app = Flask(__name__) @app.route('/') def index(): return '<h1>hello w ...
- 写在HTTP协议之前
1.网络模型 OSI模型即:开放系统互连参考模型(Open System Interconnect 简称OSI)是国际标准化组织(ISO)和国际电报电话咨询委员会(CCITT)联合制定的开放系统互连参 ...
- 微信团队分享:Kotlin渐被认可,Android版微信的技术尝鲜之旅
本文由微信开发团队工程是由“oneliang”原创发表于WeMobileDev公众号,内容稍有改动. 1.引言 Kotlin 是一个用于现代多平台应用的静态编程语言,由 JetBrains 开发( ...
- 吴恩达机器学习笔记46-K-均值算法(K-Means Algorithm)
K-均值是最普及的聚类算法,算法接受一个未标记的数据集,然后将数据聚类成不同的组. K-均值是一个迭代算法,假设我们想要将数据聚类成n 个组,其方法为: 首先选择
- HTML学习一_网页的基本结构及HTML简介
HTML网页的基本结构 ```angular2html<!DOCTYPE html> 声明为 HTML5 文档<html> 元素是 HTML 页面的根元素<head> ...
- LabVIEW(九):程序结构中的分支结构和顺序结构
一.分支结构 1.创建分支结构:程序框图右键>结构>条件结构 2.Ctrl + I 会显示错误列表,双击错误列表会定位到该错误在程序框图中地方. 3.有的分支可以不连接分支内容. 在不连接 ...
- java后端服务器读取excel将数据导入数据库
使用的是easypoi,官网文档:http://easypoi.mydoc.io/ /** * 导入Excel文件 */ @PostMapping("/importTeacher" ...
- TypeScript基础类型,类实例和函数类型声明
TypeScript(TS)是微软研发的编程语言,是JavaScript的超集,也就是在JavaScript的基础上添加了一些特性.其中之一就是类型声明. 一.基础类型 TS的基础类型有 Boolea ...
- eos开发(一) eos开发环境搭建
区块链最近挺火的,我又是个非常缺钱的人,所以紧跟了潮流一头扎进区块链的研究中. 这EOS项目是目前比较火的一个项目,相信很多朋友拿到这份EOS的源代码后都会一脸懵逼,因为……这代码写得太高级了,老纸看 ...