使用Scala编写Spark程序求基站下移动用户停留时长TopN
使用Scala编写Spark程序求基站下移动用户停留时长TopN
1. 需求:根据手机基站日志计算停留时长的TopN
我们的手机之所以能够实现移动通信,是因为在全国各地有许许多多的基站,只要手机一开机,就会和附近的基站尝试建立连接,而每一次建立连接和断开连接都会被记录到移动运营商的基站服务器的日志中。
虽然我们不知道手机用户所在的具体位置,但是根据基站的位置就可以大致判断手机用户的所处的地理范围,然后商家就可以根据用户的位置信息来做一些推荐广告。
为了便于理解,我们简单模拟了基站上的一些移动用户日志数据,共4个字段:手机号码,时间戳,基站id,连接类型(1表示建立连接,0表示断开连接)
基站A的用户日志(19735E1C66.log文件):
18688888888,20160327082400,16030401EAFB68F1E3CDF819735E1C66,1
18611132889,20160327082500,16030401EAFB68F1E3CDF819735E1C66,1
18688888888,20160327170000,16030401EAFB68F1E3CDF819735E1C66,0
18611132889,20160327180000,16030401EAFB68F1E3CDF819735E1C66,0
基站B的用户日志(DDE7970F68.log文件):
18611132889,20160327075000,9F36407EAD0629FC166F14DDE7970F68,1
18688888888,20160327075100,9F36407EAD0629FC166F14DDE7970F68,1
18611132889,20160327081000,9F36407EAD0629FC166F14DDE7970F68,0
18688888888,20160327081300,9F36407EAD0629FC166F14DDE7970F68,0
18688888888,20160327175000,9F36407EAD0629FC166F14DDE7970F68,1
18611132889,20160327182000,9F36407EAD0629FC166F14DDE7970F68,1
18688888888,20160327220000,9F36407EAD0629FC166F14DDE7970F68,0
18611132889,20160327230000,9F36407EAD0629FC166F14DDE7970F68,0
基站C的用户日志(E549D940E0.log文件):
18611132889,20160327081100,CC0710CC94ECC657A8561DE549D940E0,1
18688888888,20160327081200,CC0710CC94ECC657A8561DE549D940E0,1
18688888888,20160327081900,CC0710CC94ECC657A8561DE549D940E0,0
18611132889,20160327082000,CC0710CC94ECC657A8561DE549D940E0,0
18688888888,20160327171000,CC0710CC94ECC657A8561DE549D940E0,1
18688888888,20160327171600,CC0710CC94ECC657A8561DE549D940E0,0
18611132889,20160327180500,CC0710CC94ECC657A8561DE549D940E0,1
18611132889,20160327181500,CC0710CC94ECC657A8561DE549D940E0,0
下面是基站表的数据(loc_info.txt文件),共4个字段,分别代表基站id和经纬度以及信号的辐射类型(信号级别,比如2G信号、3G信号和4G信号等):
9F36407EAD0629FC166F14DDE7970F68,116.304864,40.050645,6
CC0710CC94ECC657A8561DE549D940E0,116.303955,40.041935,6
16030401EAFB68F1E3CDF819735E1C66,116.296302,40.032296,6
基于以上数据,要求计算每个手机号码在每个基站停留时间最长的2个地点(经纬度)。
思路:
(1)读取日志数据,并切分字段;
(2)整理字段,以手机号码和基站id为key,时间为value封装成Tuple;
(3)根据key进行聚合,将时间累加;
(4)将数据以基站id为key,以手机号码和时间为value封装成Tuple,便于后面和基站表进行join;
(5)读取基站数据,并切分字段;
(6)整理字段,以基站id为key,以基站的经度和纬度为value封装到Tuple;
(7)将两个Tuple进行join;
(8)对join后的结果按照手机号码分组;
(9)将分组后的结果转成List,在按照时间排序,在反转,最后取Top2;
(10)将计算结果写入HDFS;
2. 准备测试数据
1) 移动用户的日志信息(即上面的3个log文件):
2) 基站数据(即上面的loc_info.txt文件):
3. pom文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion> <groupId>com.xuebusi</groupId>
<artifactId>spark</artifactId>
<version>1.0-SNAPSHOT</version> <properties>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<encoding>UTF-8</encoding> <!-- 这里对jar包版本做集中管理 -->
<scala.version>2.10.6</scala.version>
<spark.version>1.6.2</spark.version>
<hadoop.version>2.6.4</hadoop.version>
</properties> <dependencies>
<dependency>
<!-- scala语言核心包 -->
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<!-- spark核心包 -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency> <dependency>
<!-- hadoop的客户端,用于访问HDFS -->
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies> <build>
<pluginManagement> <plugins>
<!-- scala-maven-plugin:编译scala程序的Maven插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
</plugin>
<!-- maven-compiler-plugin:编译java程序的Maven插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<!-- 编译scala程序的Maven插件的一些配置参数 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 编译java程序的Maven插件的一些配置参数 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- maven-shade-plugin:打jar包用的Mavne插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build> </project>
4. 编写Scala程序
在src/main/scala/目录下创建一个名为MobileLocation的Object:
MobileLocation.scala完整代码:
package com.xuebusi.spark import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext} /**
* 计算每个基站下停留时间最长的2个手机号
* Created by SYJ on 2017/1/24.
*/
object MobileLocation { def main(args: Array[String]) {
/**
* 创建SparkConf
*
* 一些说明:
* 为了便于在IDEA中进行Debug测试,
* 这里就设置为local模式,即在本地运行Spark程序;
* 但是这种方式存在一个问题,如果要从HDFS中读数据,
* 在Windows平台下读取Linux上HDFS中的数据的话,
* 可能会抛异常,因为它在读取数据的时候要用到Windows
* 下的一些本地库;
*
* 在使用Eclipse在Windows上运行MapReduce程序的时候也会遇到
* 该问题,但是在Linux和MacOS操作系统中则不会遇到这种问题.
*
* Hadoop的压缩和解压缩要用到Windows的一些本地库,
* 而这些库是C或者C++编写的,而C和C++编写的库文件是不跨平台的,
* 所以要想在Windows下调试MapReduce程序需要先安装好本地库;
*
* 建议在Windows下安装Linux虚拟机,带有图形界面的,
* 这样调试就不会有问题.
*
*/
//本地运行
val conf: SparkConf = new SparkConf().setAppName("MobileLocation").setMaster("local")
//创建SparkConf,默认以集群方式运行
//val conf: SparkConf = new SparkConf().setAppName("MobileLocation") //创建SparkContext
val sc: SparkContext = new SparkContext(conf) //从文件系统读取数据
val lines: RDD[String] = sc.textFile(args(0)) /**
* 切分数据
* 这里使用了两个map方法,不建议使用这种方式,
* 我们可以在一个map方法中完成
*/
//lines.map(_.split(",")).map(arr => (arr(0), arr(1).toLong, arr(2), arr(3))) //在一个map方法中实现对数据的切分,并组装成元组的形式
val splited = lines.map(line => {
val fields: Array[String] = line.split(",")
val mobile: String = fields(0)
//val time: Long = fields(1).toLong
val lac: String = fields(2)
val tp: String = fields(3)
val time: Long = if(tp == "1") -fields(1).toLong else fields(1).toLong //将字段拼接成元组再返回
//((手机号码, 基站id), 时间)
((mobile, lac), time)
}) //分组聚合
val reduced: RDD[((String, String), Long)] = splited.reduceByKey(_+_) //整理成元组格式,便于下一步和基站表进行join
val lacAndMobieTime = reduced.map(x => {
//(基站id, (手机号码, 时间))
(x._1._2, (x._1._1, x._2))
}) //读取基站数据
val lacInfo: RDD[String] = sc.textFile(args(1)) //切分数据并jion
val splitedLacInfo = lacInfo.map(line => {
val fields: Array[String] = line.split(",")
val id: String = fields(0)//基站id
val x: String = fields(1)//基站经度
val y: String = fields(2)//基站纬度
/**
* 返回数据
* 只有key-value类型的数据才可以进行join,
* 所以这里返回元组,以基站id为key,
* 以基站的经纬度为value:
* (基站id, (经度, 纬度))
*/
(id, (x, y))
}) //join
//返回:RDD[(基站id, ((手机号码, 时间), (经度, 纬度)))]
val joined: RDD[(String, ((String, Long), (String, String)))] = lacAndMobieTime.join(splitedLacInfo)
//ArrayBuffer((CC0710CC94ECC657A8561DE549D940E0,((18688888888,1300),(116.303955,40.041935))), (CC0710CC94ECC657A8561DE549D940E0,((18611132889,1900),(116.303955,40.041935))), (16030401EAFB68F1E3CDF819735E1C66,((18688888888,87600),(116.296302,40.032296))), (16030401EAFB68F1E3CDF819735E1C66,((18611132889,97500),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18611132889,54000),(116.304864,40.050645))), (9F36407EAD0629FC166F14DDE7970F68,((18688888888,51200),(116.304864,40.050645))))
//System.out.println(joined.collect().toBuffer) //按手机号码分组
val groupedByMobile = joined.groupBy(_._2._1._1)
//ArrayBuffer((18688888888,CompactBuffer((CC0710CC94ECC657A8561DE549D940E0,((18688888888,1300),(116.303955,40.041935))), (16030401EAFB68F1E3CDF819735E1C66,((18688888888,87600),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18688888888,51200),(116.304864,40.050645))))), (18611132889,CompactBuffer((CC0710CC94ECC657A8561DE549D940E0,((18611132889,1900),(116.303955,40.041935))), (16030401EAFB68F1E3CDF819735E1C66,((18611132889,97500),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18611132889,54000),(116.304864,40.050645))))))
//System.out.println(groupedByMobile.collect().toBuffer) /**
* 先转成List,再按照时间排序,再反转元素,再取Top2
*/
val result = groupedByMobile.mapValues(_.toList.sortBy(_._2._1._2).reverse.take(2))
//ArrayBuffer((18688888888,List((16030401EAFB68F1E3CDF819735E1C66,((18688888888,87600),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18688888888,51200),(116.304864,40.050645))))), (18611132889,List((16030401EAFB68F1E3CDF819735E1C66,((18611132889,97500),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18611132889,54000),(116.304864,40.050645))))))
//System.out.println(result.collect().toBuffer) //将结果写入到文件系统
result.saveAsTextFile(args(2)) //释放资源
sc.stop()
}
}
5. 本地测试
编辑配置信息:
点击“+”号,添加一个配置:
选择“Application”:
选择Main方法所在的类:
填写配置的名称,在Program arguments输入框中填写3个参数,分别为两个输入目录和一个输出目录:
在本地运行程序:
抛出一个内存不足异常:
17/01/24 17:17:58 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: System memory 259522560 must be at least 4.718592E8. Please use a larger heap size.
at org.apache.spark.memory.UnifiedMemoryManager$.getMaxMemory(UnifiedMemoryManager.scala:198)
at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:180)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:354)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:193)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:288)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:457)
at com.xuebusi.spark.MobileLocation$.main(MobileLocation.scala:37)
at com.xuebusi.spark.MobileLocation.main(MobileLocation.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
可以程序中添加一行代码来解决:
conf.set("spark.testing.memory", "536870912")//后面的值大于512m即可
但是上面的方式不够灵活,这里我们采用给JVM传参的方式。修改配置,在VM options输入框中添加参数“-Xmx512m”或者“-Dspark.testing.memory=536870912”,内存大小为512M:
程序输出日志以及结果:
6. 提交到Spark集群上运行
本地运行和提交到集群上运行,在代码上有所区别,需要修改一行代码:
将编写好的程序使用Maven插件打成jar包:
将jar包上传到Spark集群服务器:
将测试所用的数据也上传到HDFS集群:
运行命令:
/root/apps/spark/bin/spark-submit \
--master spark://hadoop01:7077,hadoop02:7077 \
--executor-memory 512m \
--total-executor-cores 7 \
--class com.xuebusi.spark.MobileLocation \
/root/spark-1.0-SNAPSHOT.jar \
hdfs://hadoop01:9000/mobile/input/mobile_logs \
hdfs://hadoop01:9000/mobile/input/loc_logs \
hdfs://hadoop01:9000/mobile/output
7. 程序运行时的输出日志
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
// :: INFO SparkContext: Running Spark version 1.6.
// :: WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
// :: INFO SecurityManager: Changing view acls to: root
// :: INFO SecurityManager: Changing modify acls to: root
// :: INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
// :: INFO Utils: Successfully started service 'sparkDriver' on port .
// :: INFO Slf4jLogger: Slf4jLogger started
// :: INFO Remoting: Starting remoting
// :: INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.71.11:49399]
// :: INFO Utils: Successfully started service 'sparkDriverActorSystem' on port .
// :: INFO SparkEnv: Registering MapOutputTracker
// :: INFO SparkEnv: Registering BlockManagerMaster
// :: INFO DiskBlockManager: Created local directory at /tmp/blockmgr-6c4988a1-3ad2-49ad-8cc2-02390384792b
// :: INFO MemoryStore: MemoryStore started with capacity 517.4 MB
// :: INFO SparkEnv: Registering OutputCommitCoordinator
// :: INFO Utils: Successfully started service 'SparkUI' on port .
// :: INFO SparkUI: Started SparkUI at http://192.168.71.11:4040
// :: INFO HttpFileServer: HTTP File server directory is /tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e/httpd--8d9c-4ea7-9fca-5caf8c712f86
// :: INFO HttpServer: Starting HTTP Server
// :: INFO Utils: Successfully started service 'HTTP file server' on port .
// :: INFO SparkContext: Added JAR file:/root/spark-1.0-SNAPSHOT.jar at http://192.168.71.11:58135/jars/spark-1.0-SNAPSHOT.jar with timestamp 1485286086076
// :: INFO Executor: Starting executor ID driver on host localhost
// :: INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port .
// :: INFO NettyBlockTransferService: Server created on
// :: INFO BlockManagerMaster: Trying to register BlockManager
// :: INFO BlockManagerMasterEndpoint: Registering block manager localhost: with 517.4 MB RAM, BlockManagerId(driver, localhost, )
// :: INFO BlockManagerMaster: Registered BlockManager
// :: INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB)
// :: INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.5 KB)
// :: INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost: (size: 13.9 KB, free: 517.4 MB)
// :: INFO SparkContext: Created broadcast from textFile at MobileLocation.scala:
// :: INFO FileInputFormat: Total input paths to process :
// :: INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 86.4 KB, free 253.9 KB)
// :: INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 19.3 KB, free 273.2 KB)
// :: INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost: (size: 19.3 KB, free: 517.4 MB)
// :: INFO SparkContext: Created broadcast from textFile at MobileLocation.scala:
// :: INFO FileInputFormat: Total input paths to process :
// :: INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
// :: INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
// :: INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
// :: INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
// :: INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
// :: INFO SparkContext: Starting job: saveAsTextFile at MobileLocation.scala:
// :: INFO DAGScheduler: Registering RDD (map at MobileLocation.scala:)
// :: INFO DAGScheduler: Registering RDD (map at MobileLocation.scala:)
// :: INFO DAGScheduler: Registering RDD (map at MobileLocation.scala:)
// :: INFO DAGScheduler: Registering RDD (groupBy at MobileLocation.scala:)
// :: INFO DAGScheduler: Got job (saveAsTextFile at MobileLocation.scala:) with output partitions
// :: INFO DAGScheduler: Final stage: ResultStage (saveAsTextFile at MobileLocation.scala:)
// :: INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage )
// :: INFO DAGScheduler: Missing parents: List(ShuffleMapStage )
// :: INFO DAGScheduler: Submitting ShuffleMapStage (MapPartitionsRDD[] at map at MobileLocation.scala:), which has no missing parents
// :: INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.8 KB, free 277.0 KB)
// :: INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.2 KB, free 279.2 KB)
// :: INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost: (size: 2.2 KB, free: 517.4 MB)
// :: INFO SparkContext: Created broadcast from broadcast at DAGScheduler.scala:
// :: INFO DAGScheduler: Submitting missing tasks from ShuffleMapStage (MapPartitionsRDD[] at map at MobileLocation.scala:)
// :: INFO TaskSchedulerImpl: Adding task set 1.0 with tasks
// :: INFO DAGScheduler: Submitting ShuffleMapStage (MapPartitionsRDD[] at map at MobileLocation.scala:), which has no missing parents
// :: INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.9 KB, free 283.1 KB)
// :: INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.2 KB, free 285.3 KB)
// :: INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID , localhost, partition ,ANY, bytes)
// :: INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost: (size: 2.2 KB, free: 517.4 MB)
// :: INFO SparkContext: Created broadcast from broadcast at DAGScheduler.scala:
// :: INFO DAGScheduler: Submitting missing tasks from ShuffleMapStage (MapPartitionsRDD[] at map at MobileLocation.scala:)
// :: INFO TaskSchedulerImpl: Adding task set 0.0 with tasks
// :: INFO Executor: Running task 0.0 in stage 1.0 (TID )
// :: INFO Executor: Fetching http://192.168.71.11:58135/jars/spark-1.0-SNAPSHOT.jar with timestamp 1485286086076
// :: INFO Utils: Fetching http://192.168.71.11:58135/jars/spark-1.0-SNAPSHOT.jar to /tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e/userFiles-1790c4be-0a0d-45fd-91f9-c66c49e765a5/fetchFileTemp1508611941727652704.tmp
// :: INFO Executor: Adding file:/tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e/userFiles-1790c4be-0a0d-45fd-91f9-c66c49e765a5/spark-1.0-SNAPSHOT.jar to class loader
// :: INFO HadoopRDD: Input split: hdfs://hadoop01:9000/mobile/input/loc_logs/loc_info.txt:0+171
// :: INFO Executor: Finished task 0.0 in stage 1.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID , localhost, partition ,ANY, bytes)
// :: INFO Executor: Running task 0.0 in stage 0.0 (TID )
// :: INFO HadoopRDD: Input split: hdfs://hadoop01:9000/mobile/input/mobile_logs/19735E1C66.log:0+248
// :: INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID ) in ms on localhost (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: ShuffleMapStage (map at MobileLocation.scala:) finished in 5.716 s
// :: INFO DAGScheduler: looking for newly runnable stages
// :: INFO DAGScheduler: running: Set(ShuffleMapStage )
// :: INFO DAGScheduler: waiting: Set(ShuffleMapStage , ShuffleMapStage , ResultStage )
// :: INFO DAGScheduler: failed: Set()
// :: INFO Executor: Finished task 0.0 in stage 0.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID , localhost, partition ,ANY, bytes)
// :: INFO Executor: Running task 1.0 in stage 0.0 (TID )
// :: INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID ) in ms on localhost (/)
// :: INFO HadoopRDD: Input split: hdfs://hadoop01:9000/mobile/input/mobile_logs/DDE7970F68.log:0+496
// :: INFO Executor: Finished task 1.0 in stage 0.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID , localhost, partition ,ANY, bytes)
// :: INFO Executor: Running task 2.0 in stage 0.0 (TID )
// :: INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID ) in ms on localhost (/)
// :: INFO HadoopRDD: Input split: hdfs://hadoop01:9000/mobile/input/mobile_logs/E549D940E0.log:0+496
// :: INFO Executor: Finished task 2.0 in stage 0.0 (TID ). bytes result sent to driver
// :: INFO DAGScheduler: ShuffleMapStage (map at MobileLocation.scala:) finished in 6.045 s
// :: INFO DAGScheduler: looking for newly runnable stages
// :: INFO DAGScheduler: running: Set()
// :: INFO DAGScheduler: waiting: Set(ShuffleMapStage , ShuffleMapStage , ResultStage )
// :: INFO DAGScheduler: failed: Set()
// :: INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID ) in ms on localhost (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Submitting ShuffleMapStage (MapPartitionsRDD[] at map at MobileLocation.scala:), which has no missing parents
// :: INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.9 KB, free 288.3 KB)
// :: INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1800.0 B, free 290.0 KB)
// :: INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost: (size: 1800.0 B, free: 517.4 MB)
// :: INFO SparkContext: Created broadcast from broadcast at DAGScheduler.scala:
// :: INFO DAGScheduler: Submitting missing tasks from ShuffleMapStage (MapPartitionsRDD[] at map at MobileLocation.scala:)
// :: INFO TaskSchedulerImpl: Adding task set 2.0 with tasks
// :: INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID , localhost, partition ,NODE_LOCAL, bytes)
// :: INFO Executor: Running task 0.0 in stage 2.0 (TID )
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO Executor: Finished task 0.0 in stage 2.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID , localhost, partition ,NODE_LOCAL, bytes)
// :: INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID ) in ms on localhost (/)
// :: INFO Executor: Running task 1.0 in stage 2.0 (TID )
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO Executor: Finished task 1.0 in stage 2.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID , localhost, partition ,NODE_LOCAL, bytes)
// :: INFO Executor: Running task 2.0 in stage 2.0 (TID )
// :: INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID ) in ms on localhost (/)
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO Executor: Finished task 2.0 in stage 2.0 (TID ). bytes result sent to driver
// :: INFO DAGScheduler: ShuffleMapStage (map at MobileLocation.scala:) finished in 0.673 s
// :: INFO DAGScheduler: looking for newly runnable stages
// :: INFO DAGScheduler: running: Set()
// :: INFO DAGScheduler: waiting: Set(ShuffleMapStage , ResultStage )
// :: INFO DAGScheduler: failed: Set()
// :: INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID ) in ms on localhost (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Submitting ShuffleMapStage (MapPartitionsRDD[] at groupBy at MobileLocation.scala:), which has no missing parents
// :: INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 4.1 KB, free 294.2 KB)
// :: INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 296.2 KB)
// :: INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost: (size: 2.1 KB, free: 517.4 MB)
// :: INFO SparkContext: Created broadcast from broadcast at DAGScheduler.scala:
// :: INFO DAGScheduler: Submitting missing tasks from ShuffleMapStage (MapPartitionsRDD[] at groupBy at MobileLocation.scala:)
// :: INFO TaskSchedulerImpl: Adding task set 3.0 with tasks
// :: INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID , localhost, partition ,PROCESS_LOCAL, bytes)
// :: INFO Executor: Running task 0.0 in stage 3.0 (TID )
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO Executor: Finished task 0.0 in stage 3.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID , localhost, partition ,PROCESS_LOCAL, bytes)
// :: INFO Executor: Running task 1.0 in stage 3.0 (TID )
// :: INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID ) in ms on localhost (/)
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO Executor: Finished task 1.0 in stage 3.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID , localhost, partition ,PROCESS_LOCAL, bytes)
// :: INFO Executor: Running task 2.0 in stage 3.0 (TID )
// :: INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID ) in ms on localhost (/)
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO Executor: Finished task 2.0 in stage 3.0 (TID ). bytes result sent to driver
// :: INFO DAGScheduler: ShuffleMapStage (groupBy at MobileLocation.scala:) finished in 0.606 s
// :: INFO DAGScheduler: looking for newly runnable stages
// :: INFO DAGScheduler: running: Set()
// :: INFO DAGScheduler: waiting: Set(ResultStage )
// :: INFO DAGScheduler: failed: Set()
// :: INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID ) in ms on localhost (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Submitting ResultStage (MapPartitionsRDD[] at saveAsTextFile at MobileLocation.scala:), which has no missing parents
// :: INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 66.4 KB, free 362.6 KB)
// :: INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 23.0 KB, free 385.6 KB)
// :: INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost: (size: 23.0 KB, free: 517.3 MB)
// :: INFO SparkContext: Created broadcast from broadcast at DAGScheduler.scala:
// :: INFO DAGScheduler: Submitting missing tasks from ResultStage (MapPartitionsRDD[] at saveAsTextFile at MobileLocation.scala:)
// :: INFO TaskSchedulerImpl: Adding task set 4.0 with tasks
// :: INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID , localhost, partition ,NODE_LOCAL, bytes)
// :: INFO Executor: Running task 0.0 in stage 4.0 (TID )
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO FileOutputCommitter: Saved output of task 'attempt_201701241128_0004_m_000000_10' to hdfs://hadoop01:9000/mobile/output/_temporary/0/task_201701241128_0004_m_000000
// :: INFO SparkHadoopMapRedUtil: attempt_201701241128_0004_m_000000_10: Committed
// :: INFO Executor: Finished task 0.0 in stage 4.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID , localhost, partition ,NODE_LOCAL, bytes)
// :: INFO Executor: Running task 1.0 in stage 4.0 (TID )
// :: INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID ) in ms on localhost (/)
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost: in memory (size: 2.2 KB, free: 517.3 MB)
// :: INFO BlockManagerInfo: Removed broadcast_5_piece0 on localhost: in memory (size: 2.1 KB, free: 517.3 MB)
// :: INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost: in memory (size: 1800.0 B, free: 517.3 MB)
// :: INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost: in memory (size: 2.2 KB, free: 517.4 MB)
// :: INFO FileOutputCommitter: Saved output of task 'attempt_201701241128_0004_m_000001_11' to hdfs://hadoop01:9000/mobile/output/_temporary/0/task_201701241128_0004_m_000001
// :: INFO SparkHadoopMapRedUtil: attempt_201701241128_0004_m_000001_11: Committed
// :: INFO Executor: Finished task 1.0 in stage 4.0 (TID ). bytes result sent to driver
// :: INFO TaskSetManager: Starting task 2.0 in stage 4.0 (TID , localhost, partition ,NODE_LOCAL, bytes)
// :: INFO Executor: Running task 2.0 in stage 4.0 (TID )
// :: INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID ) in ms on localhost (/)
// :: INFO ShuffleBlockFetcherIterator: Getting non-empty blocks out of blocks
// :: INFO ShuffleBlockFetcherIterator: Started remote fetches in ms
// :: INFO FileOutputCommitter: Saved output of task 'attempt_201701241128_0004_m_000002_12' to hdfs://hadoop01:9000/mobile/output/_temporary/0/task_201701241128_0004_m_000002
// :: INFO SparkHadoopMapRedUtil: attempt_201701241128_0004_m_000002_12: Committed
// :: INFO Executor: Finished task 2.0 in stage 4.0 (TID ). bytes result sent to driver
// :: INFO DAGScheduler: ResultStage (saveAsTextFile at MobileLocation.scala:) finished in 3.750 s
// :: INFO TaskSetManager: Finished task 2.0 in stage 4.0 (TID ) in ms on localhost (/)
// :: INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
// :: INFO DAGScheduler: Job finished: saveAsTextFile at MobileLocation.scala:, took 14.029200 s
// :: INFO SparkUI: Stopped Spark web UI at http://192.168.71.11:4040
// :: INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
// :: INFO MemoryStore: MemoryStore cleared
// :: INFO BlockManager: BlockManager stopped
// :: INFO BlockManagerMaster: BlockManagerMaster stopped
// :: INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
// :: INFO SparkContext: Successfully stopped SparkContext
// :: INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
// :: INFO ShutdownHookManager: Shutdown hook called
// :: INFO ShutdownHookManager: Deleting directory /tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e
// :: INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
// :: INFO ShutdownHookManager: Deleting directory /tmp/spark-eaa76419-9ddc-43c1-ad0e-cc95c0dede7e/httpd--8d9c-4ea7-9fca-5caf8c712f86
[root@hadoop01 ~]#
[root@hadoop01 ~]#
[root@hadoop01 ~]#
8. 查看输出结果
[root@hadoop01 ~]# hdfs dfs -cat /mobile/output/part-00000
[root@hadoop01 ~]# hdfs dfs -cat /mobile/output/part-00001
(18688888888,List((16030401EAFB68F1E3CDF819735E1C66,((18688888888,87600),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18688888888,51200),(116.304864,40.050645)))))
(18611132889,List((16030401EAFB68F1E3CDF819735E1C66,((18611132889,97500),(116.296302,40.032296))), (9F36407EAD0629FC166F14DDE7970F68,((18611132889,54000),(116.304864,40.050645)))))
[root@hadoop01 ~]# hdfs dfs -cat /mobile/output/part-00002
[root@hadoop01 ~]#
使用Scala编写Spark程序求基站下移动用户停留时长TopN的更多相关文章
- 大数据学习day25------spark08-----1. 读取数据库的形式创建DataFrame 2. Parquet格式的数据源 3. Orc格式的数据源 4.spark_sql整合hive 5.在IDEA中编写spark程序(用来操作hive) 6. SQL风格和DSL风格以及RDD的形式计算连续登陆三天的用户
1. 读取数据库的形式创建DataFrame DataFrameFromJDBC object DataFrameFromJDBC { def main(args: Array[String]): U ...
- sbt打包Scala写的Spark程序,打包正常,提交运行时提示找不到对应的类
sbt打包Scala写的Spark程序,打包正常,提交运行时提示找不到对应的类 详述 使用sbt对写的Spark程序打包,过程中没有问题 spark-submit提交jar包运行提示找不到对应的类 解 ...
- pycharm编写spark程序,导入pyspark包
一种方法: File --> Default Setting --> 选中Project Interpreter中的一个python版本-->点击右边锯齿形图标(设置)-->选 ...
- indows Eclipse Scala编写WordCount程序
Windows Eclipse Scala编写WordCount程序: 1)无需启动hadoop,因为我们用的是本地文件.先像原来一样,做一个普通的scala项目和Scala Object. 但这里一 ...
- (转载)用VS2012或VS2013在win7下编写的程序在XP下运行就出现“不是有效的win32应用程序“
原文地址:http://www.vcerror.com/?p=1483 问题描述: 用VC2013编译了一个程序,在Windows 8.Windows 7(64位.32位)下都能正常运行.但在Win ...
- 已知一个正整数m,编写一个程序求m的反序数(待消化)
import java.util.Scanner; /** * @author:(LiberHome) * @date:Created in 2019/3/5 21:08 * @description ...
- 如何让VS2012编写的程序在XP下运行
Win32主程序需要以下设置 第一步:在工程属性General设置 第二步:在C/C++ Code Generation 设置 第三步:SubSystem 和 Minimum Required Ve ...
- idea配置scala编写spark wordcount程序
1.创建scala maven项目 选择骨架的时候为org.scala-tools.archetypes:scala-aechetype-simple 1.2 2.导入包,进入spark官网Docum ...
- Spark&Hadoop:scala编写spark任务jar包,运行无法识别main函数,怎么办?
昨晚和同事一起看一个scala写的程序,程序都写完了,且在idea上debug运行是ok的.但我们不能调试的方式部署在客户机器上,于是打包吧.打包时,我们是采用把外部引入的五个包(spark-asse ...
随机推荐
- java面试第三天
类和对象: 类:主观抽象,是对象的模板,可以实例化对象----具有相同属性和行为的对象的集合. 习惯上类的定义格式: package xxx; import xxx; public class Xxx ...
- Mysql 监视工具
对于开发人员来说,最头大的莫过于 ,你本地没事,线上 错误日志一堆. 尤其是数据库通信那一层.SqlServer 有 sql profile 用来监视对应的server上的通信日志,参数 命令等信息. ...
- 错误号:1364 错误信息:Field 'platId' doesn't have a default value
1. 错误描写叙述 错误号:1364 错误信息:Field 'platId' doesn't have a default value insert into `use`.`t_platform_sc ...
- HDUOJ-------Naive and Silly Muggles
Naive and Silly Muggles Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/ ...
- Hibernate学习备忘
1.关于Hibernate异常: org.hibernate.service.jndi.JndiException: Error parsing JNDI name 刚接触Hibernate,调试 ...
- 使用SoapUI生成WS请求报文
WSDL地址示例:http://10.1.84.10:8100/webService/common/mail?wsdl 打开SoapUI,创建一个Project,输入wsdl地址就ok. 1.访问 ...
- http://www.cnblogs.com/txw1958/p/alipay-f2fpay.html
一.条码支付及二维码支付介绍 1. 条码支付 条码支付是支付宝给到线下传统行业的一种收款方式.商家使用扫码枪等条码识别设备扫描用户支付宝钱包上的条码/二维码,完成收款.用户仅需出示付款码,所有收款操作 ...
- VMWare安装Linux系统之CentOS-6.6操作方法。
1.使用VMWare创建新的虚拟主机 2.使用VMWare安装Linux,点击“开启虚拟主机” 3.进入Linux安装界面,选择第一项"Install or upgrade an exist ...
- 在 Asp.NET MVC 中使用 SignalR 实现推送功能 [转]
在 Asp.NET MVC 中使用 SignalR 实现推送功能 罗朝辉 ( http://blog.csdn.net/kesalin ) CC许可,转载请注明出处 一,简介 Signal 是微软支持 ...
- Jersey框架
我从别人博客那儿搬点东西过来,原博请看最下面~看的顺序反了..应该先看JAX-RS整体的东西再看具体实现的Jersey例子的= =无数次改这个日记了不能忍...所以决定把JAX-RS系列的文章搬过来. ...