spark的运行指标监控
sparkUi的4040界面已经有了运行监控指标,为什么我们还要自定义存入redis?
1.结合自己的业务,可以将监控页面集成到自己的数据平台内,方便问题查找,邮件告警
2.可以在sparkUi的基础上,添加一些自己想要指标统计
一、spark的SparkListener
sparkListener是一个接口,我们使用时需要自定义监控类实现sparkListener接口中的各种抽象方法,SparkListener 下各个事件对应的函数名非常直白,即如字面所表达意思。 想对哪个阶段的事件做一些自定义的动作,变继承SparkListener实现对应的函数即可,这些方法会帮助我监控spark运行时各个阶段的数据量,从而我们可以获得这些监控指标数据
abstract class SparkListener extends SparkListenerInterface {
//stage完成的时调用
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = { }
//stage提交时调用
override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = { } override def onTaskStart(taskStart: SparkListenerTaskStart): Unit = { } override def onTaskGettingResult(taskGettingResult: SparkListenerTaskGettingResult): Unit = { }
//task结束时调用
override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { } override def onJobStart(jobStart: SparkListenerJobStart): Unit = { } override def onJobEnd(jobEnd: SparkListenerJobEnd): Unit = { } override def onEnvironmentUpdate(environmentUpdate: SparkListenerEnvironmentUpdate): Unit = { } override def onBlockManagerAdded(blockManagerAdded: SparkListenerBlockManagerAdded): Unit = { } override def onBlockManagerRemoved(
blockManagerRemoved: SparkListenerBlockManagerRemoved): Unit = { } override def onUnpersistRDD(unpersistRDD: SparkListenerUnpersistRDD): Unit = { } override def onApplicationStart(applicationStart: SparkListenerApplicationStart): Unit = { } override def onApplicationEnd(applicationEnd: SparkListenerApplicationEnd): Unit = { } override def onExecutorMetricsUpdate(
executorMetricsUpdate: SparkListenerExecutorMetricsUpdate): Unit = { } override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit = { } override def onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved): Unit = { } override def onBlockUpdated(blockUpdated: SparkListenerBlockUpdated): Unit = { } override def onOtherEvent(event: SparkListenerEvent): Unit = { }
}
1.实现自己SparkListener,对onTaskEnd方法是指标存入redis
(1)SparkListener是一个接口,创建一个MySparkAppListener类继承SparkListener,实现里面的onTaskEnd即可
(2)方法:override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { }
SparkListenerTaskEnd类:
case class SparkListenerTaskEnd(
//spark的stageId
stageId: Int,
//尝试的阶段Id(也就是下级Stage?)
stageAttemptId: Int,
taskType: String,
reason: TaskEndReason,
//task信息
taskInfo: TaskInfo,
// task指标
@Nullable taskMetrics: TaskMetrics)
extends SparkListenerEvent
(3)在 onTaskEnd方法内可以通过成员taskinfo与taskMetrics获取的信息
/**
* 1、taskMetrics
* 2、shuffle
* 3、task运行(input output)
* 4、taskInfo
**/
(4)TaskMetrics可以获取的监控信息
class TaskMetrics private[spark] () extends Serializable {
// Each metric is internally represented as an accumulator
private val _executorDeserializeTime = new LongAccumulator
private val _executorDeserializeCpuTime = new LongAccumulator
private val _executorRunTime = new LongAccumulator
private val _executorCpuTime = new LongAccumulator
private val _resultSize = new LongAccumulator
private val _jvmGCTime = new LongAccumulator
private val _resultSerializationTime = new LongAccumulator
private val _memoryBytesSpilled = new LongAccumulator
private val _diskBytesSpilled = new LongAccumulator
private val _peakExecutionMemory = new LongAccumulator
private val _updatedBlockStatuses = new CollectionAccumulator[(BlockId, BlockStatus)]
val inputMetrics: InputMetrics = new InputMetrics() /**
* Metrics related to writing data externally (e.g. to a distributed filesystem),
* defined only in tasks with output.
*/
val outputMetrics: OutputMetrics = new OutputMetrics() /**
* Metrics related to shuffle read aggregated across all shuffle dependencies.
* This is defined only if there are shuffle dependencies in this task.
*/
val shuffleReadMetrics: ShuffleReadMetrics = new ShuffleReadMetrics() /**
* Metrics related to shuffle write, defined only in shuffle map stages.
*/
val shuffleWriteMetrics: ShuffleWriteMetrics = new ShuffleWriteMetrics()
(5)代码实现并存入redis
/**
* 需求1.想自定义spark的job运行情况存入redis,集成到自己的业务后台展示中
*/
class MySparkAppListener extends SparkListener with Logging { val redisConf = "jedisConfig.properties" val jedis: Jedis = JedisUtil.getInstance().getJedis //父类的第一个方法
override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
//在 onTaskEnd方法内可以获取的信息有
/**
* 1、taskMetrics
* 2、shuffle
* 3、task运行(input output)
* 4、taskInfo
**/ val currentTimestamp = System.currentTimeMillis()
// TaskMetrics(task的指标)可以拿到的指标
/**
* private val _executorDeserializeTime = new LongAccumulator
* private val _executorDeserializeCpuTime = new LongAccumulator
* private val _executorRunTime = new LongAccumulator
* private val _executorCpuTime = new LongAccumulator
* private val _resultSize = new LongAccumulator
* private val _jvmGCTime = new LongAccumulator
* private val _resultSerializationTime = new LongAccumulator
* private val _memoryBytesSpilled = new LongAccumulator
* private val _diskBytesSpilled = new LongAccumulator
* private val _peakExecutionMemory = new LongAccumulator
* private val _updatedBlockStatuses = new CollectionAccumulator[(BlockId, BlockStatus)]
*/
val metrics = taskEnd.taskMetrics
val taskMetricsMap = scala.collection.mutable.HashMap(
"executorDeserializeTime" -> metrics.executorDeserializeTime, //executor的反序列化时间
"executorDeserializeCpuTime" -> metrics.executorDeserializeCpuTime, //executor的反序列化的 cpu时间
"executorRunTime" -> metrics.executorRunTime, //executoor的运行时间
"resultSize" -> metrics.resultSize, //结果集大小
"jvmGCTime" -> metrics.jvmGCTime, //
"resultSerializationTime" -> metrics.resultSerializationTime,
"memoryBytesSpilled" -> metrics.memoryBytesSpilled, //内存溢写的大小
"diskBytesSpilled" -> metrics.diskBytesSpilled, //溢写到磁盘的大小
"peakExecutionMemory" -> metrics.peakExecutionMemory //executor的最大内存
) val jedisKey = "taskMetrics_" + {
currentTimestamp
}
jedis.set(jedisKey, Json(DefaultFormats).write(jedisKey))
jedis.pexpire(jedisKey, 3600) //======================shuffle指标================================
val shuffleReadMetrics = metrics.shuffleReadMetrics
val shuffleWriteMetrics = metrics.shuffleWriteMetrics //shuffleWriteMetrics shuffle读过程的指标有这些
/**
* private[executor] val _bytesWritten = new LongAccumulator
* private[executor] val _recordsWritten = new LongAccumulator
* private[executor] val _writeTime = new LongAccumulator
*/
//shuffleReadMetrics shuffle写过程的指标有这些
/**
* private[executor] val _remoteBlocksFetched = new LongAccumulator
* private[executor] val _localBlocksFetched = new LongAccumulator
* private[executor] val _remoteBytesRead = new LongAccumulator
* private[executor] val _localBytesRead = new LongAccumulator
* private[executor] val _fetchWaitTime = new LongAccumulator
* private[executor] val _recordsRead = new LongAccumulator
*/ val shuffleMap = scala.collection.mutable.HashMap(
"remoteBlocksFetched" -> shuffleReadMetrics.remoteBlocksFetched, //shuffle远程拉取数据块
"localBlocksFetched" -> shuffleReadMetrics.localBlocksFetched, //本地块拉取
"remoteBytesRead" -> shuffleReadMetrics.remoteBytesRead, //shuffle远程读取的字节数
"localBytesRead" -> shuffleReadMetrics.localBytesRead, //读取本地数据的字节
"fetchWaitTime" -> shuffleReadMetrics.fetchWaitTime, //拉取数据的等待时间
"recordsRead" -> shuffleReadMetrics.recordsRead, //shuffle读取的记录总数
"bytesWritten" -> shuffleWriteMetrics.bytesWritten, //shuffle写的总大小
"recordsWritte" -> shuffleWriteMetrics.recordsWritten, //shuffle写的总记录数
"writeTime" -> shuffleWriteMetrics.writeTime
) val shuffleKey = s"shuffleKey${currentTimestamp}"
jedis.set(shuffleKey, Json(DefaultFormats).write(shuffleMap))
jedis.expire(shuffleKey, 3600) //=================输入输出========================
val inputMetrics = taskEnd.taskMetrics.inputMetrics
val outputMetrics = taskEnd.taskMetrics.outputMetrics val input_output = scala.collection.mutable.HashMap(
"bytesRead" -> inputMetrics.bytesRead, //读取的大小
"recordsRead" -> inputMetrics.recordsRead, //总记录数
"bytesWritten" -> outputMetrics.bytesWritten,//输出的大小
"recordsWritten" -> outputMetrics.recordsWritten//输出的记录数
)
val input_outputKey = s"input_outputKey${currentTimestamp}"
jedis.set(input_outputKey, Json(DefaultFormats).write(input_output))
jedis.expire(input_outputKey, 3600) //####################taskInfo#######
val taskInfo: TaskInfo = taskEnd.taskInfo val taskInfoMap = scala.collection.mutable.HashMap(
"taskId" -> taskInfo.taskId ,
"host" -> taskInfo.host ,
"speculative" -> taskInfo.speculative , //推测执行
"failed" -> taskInfo.failed ,
"killed" -> taskInfo.killed ,
"running" -> taskInfo.running ) val taskInfoKey = s"taskInfo${currentTimestamp}"
jedis.set(taskInfoKey , Json(DefaultFormats).write(taskInfoMap))
jedis.expire(taskInfoKey , 3600) }
(5)程序测试
sparkContext.addSparkListener方法添加自己监控主类
sc.addSparkListener(new MySparkAppListener())
使用wordcount进行简单测试
二、spark实时监控
1.StreamingListener是实时监控的接口,里面有数据接收成功、错误、停止、批次提交、开始、完成等指标,原理与上述相同
trait StreamingListener { /** Called when a receiver has been started */
def onReceiverStarted(receiverStarted: StreamingListenerReceiverStarted) { } /** Called when a receiver has reported an error */
def onReceiverError(receiverError: StreamingListenerReceiverError) { } /** Called when a receiver has been stopped */
def onReceiverStopped(receiverStopped: StreamingListenerReceiverStopped) { } /** Called when a batch of jobs has been submitted for processing. */
def onBatchSubmitted(batchSubmitted: StreamingListenerBatchSubmitted) { } /** Called when processing of a batch of jobs has started. */
def onBatchStarted(batchStarted: StreamingListenerBatchStarted) { } /** Called when processing of a batch of jobs has completed. */
def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { } /** Called when processing of a job of a batch has started. */
def onOutputOperationStarted(
outputOperationStarted: StreamingListenerOutputOperationStarted) { } /** Called when processing of a job of a batch has completed. */
def onOutputOperationCompleted(
outputOperationCompleted: StreamingListenerOutputOperationCompleted) { }
}
2.主要指标及用途
1.onReceiverError
监控数据接收错误信息,进行邮件告警
2.onBatchCompleted 该批次完成时调用该方法
(1)sparkstreaming的偏移量提交时,当改批次执行完,才进行offset的保存入库,(该无法保证统计入库完成后程序中断、offset未提交)
(2)批次处理时间大于了规定的窗口时间,程序出现阻塞,进行邮件告警
三、spark、yarn的web返回接口进行数据解析,获取指标信息
1.启动某个本地spark程序
访问 :http://localhost:4040/metrics/json/,得到一串json数据,解析gauges,则可获取所有的信息
{
"version": "3.0.0",
"gauges": {
"local-1581865176069.driver.BlockManager.disk.diskSpaceUsed_MB": {
"value": 0
},
"local-1581865176069.driver.BlockManager.memory.maxMem_MB": {
"value": 1989
},
"local-1581865176069.driver.BlockManager.memory.memUsed_MB": {
"value": 0
},
"local-1581865176069.driver.BlockManager.memory.remainingMem_MB": {
"value": 1989
},
"local-1581865176069.driver.DAGScheduler.job.activeJobs": {
"value": 0
},
"local-1581865176069.driver.DAGScheduler.job.allJobs": {
"value": 0
},
"local-1581865176069.driver.DAGScheduler.stage.failedStages": {
"value": 0
},
"local-1581865176069.driver.DAGScheduler.stage.runningStages": {
"value": 0
},
"local-1581865176069.driver.DAGScheduler.stage.waitingStages": {
"value": 0
}
},
"counters": {
"local-1581865176069.driver.HiveExternalCatalog.fileCacheHits": {
"count": 0
},
"local-1581865176069.driver.HiveExternalCatalog.filesDiscovered": {
"count": 0
},
"local-1581865176069.driver.HiveExternalCatalog.hiveClientCalls": {
"count": 0
},
"local-1581865176069.driver.HiveExternalCatalog.parallelListingJobCount": {
"count": 0
},
"local-1581865176069.driver.HiveExternalCatalog.partitionsFetched": {
"count": 0
}
},
"histograms": {
"local-1581865176069.driver.CodeGenerator.compilationTime": {
"count": 0,
"max": 0,
"mean": 0,
"min": 0,
"p50": 0,
"p75": 0,
"p95": 0,
"p98": 0,
"p99": 0,
"p999": 0,
"stddev": 0
},
"local-1581865176069.driver.CodeGenerator.generatedClassSize": {
"count": 0,
"max": 0,
"mean": 0,
"min": 0,
"p50": 0,
"p75": 0,
"p95": 0,
"p98": 0,
"p99": 0,
"p999": 0,
"stddev": 0
},
"local-1581865176069.driver.CodeGenerator.generatedMethodSize": {
"count": 0,
"max": 0,
"mean": 0,
"min": 0,
"p50": 0,
"p75": 0,
"p95": 0,
"p98": 0,
"p99": 0,
"p999": 0,
"stddev": 0
},
"local-1581865176069.driver.CodeGenerator.sourceCodeSize": {
"count": 0,
"max": 0,
"mean": 0,
"min": 0,
"p50": 0,
"p75": 0,
"p95": 0,
"p98": 0,
"p99": 0,
"p999": 0,
"stddev": 0
}
},
"meters": { },
"timers": {
"local-1581865176069.driver.DAGScheduler.messageProcessingTime": {
"count": 0,
"max": 0,
"mean": 0,
"min": 0,
"p50": 0,
"p75": 0,
"p95": 0,
"p98": 0,
"p99": 0,
"p999": 0,
"stddev": 0,
"m15_rate": 0,
"m1_rate": 0,
"m5_rate": 0,
"mean_rate": 0,
"duration_units": "milliseconds",
"rate_units": "calls/second"
}
}
}
解析json获取指标信息
val diskSpaceUsed_MB = gauges.getJSONObject(applicationId + ".driver.BlockManager.disk.diskSpaceUsed_MB").getLong("value")//使用的磁盘空间
val maxMem_MB = gauges.getJSONObject(applicationId + ".driver.BlockManager.memory.maxMem_MB").getLong("value") //使用的最大内存
val memUsed_MB = gauges.getJSONObject(applicationId + ".driver.BlockManager.memory.memUsed_MB").getLong("value")//内存使用情况
val remainingMem_MB = gauges.getJSONObject(applicationId + ".driver.BlockManager.memory.remainingMem_MB").getLong("value") //闲置内存
//#####################stage###################################
val activeJobs = gauges.getJSONObject(applicationId + ".driver.DAGScheduler.job.activeJobs").getLong("value")//当前正在运行的job
val allJobs = gauges.getJSONObject(applicationId + ".driver.DAGScheduler.job.allJobs").getLong("value")//总job数
val failedStages = gauges.getJSONObject(applicationId + ".driver.DAGScheduler.stage.failedStages").getLong("value")//失败的stage数量
val runningStages = gauges.getJSONObject(applicationId + ".driver.DAGScheduler.stage.runningStages").getLong("value")//正在运行的stage
val waitingStages = gauges.getJSONObject(applicationId + ".driver.DAGScheduler.stage.waitingStages").getLong("value")//等待运行的stage
//#####################StreamingMetrics###################################
val lastCompletedBatch_processingDelay = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.lastCompletedBatch_processingDelay").getLong("value")// 最近批次执行的延迟时间
val lastCompletedBatch_processingEndTime = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.lastCompletedBatch_processingEndTime").getLong("value")//最近批次执行结束时间(毫秒为单位)
val lastCompletedBatch_processingStartTime = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.lastCompletedBatch_processingStartTime").getLong("value")//最近批次开始执行时间
//执行时间
val lastCompletedBatch_processingTime = (lastCompletedBatch_processingEndTime - lastCompletedBatch_processingStartTime)
val lastReceivedBatch_records = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.lastReceivedBatch_records").getLong("value")//最近批次接收的数量
val runningBatches = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.runningBatches").getLong("value")//正在运行的批次
val totalCompletedBatches = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.totalCompletedBatches").getLong("value")//完成的数据量
val totalProcessedRecords = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.totalProcessedRecords").getLong("value")//总处理条数
val totalReceivedRecords = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.totalReceivedRecords").getLong("value")//总接收条数
val unprocessedBatches = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.unprocessedBatches").getLong("value")//为处理的批次
val waitingBatches = gauges.getJSONObject(applicationId + ".driver.query.StreamingMetrics.streaming.waitingBatches").getLong("value")//处于等待状态的批次
2.spark提交至yarn
val sparkDriverHost = sc.getConf.get("spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES")
//监控信息页面路径为集群路径+/proxy/+应用id+/metrics/json
val url = s"${sparkDriverHost}/metrics/json"
3.作用
1.该job(endTime, applicationUniqueName, applicationId, sourceCount, costTime, countPerMillis)可以做表格,做链路统计
2.磁盘与内存信息做饼图,用来对内存和磁盘的监控
3.程序task的运行情况做表格,用来对job的监控
spark的运行指标监控的更多相关文章
- 通过案例对 spark streaming 透彻理解三板斧之三:spark streaming运行机制与架构
本期内容: 1. Spark Streaming Job架构与运行机制 2. Spark Streaming 容错架构与运行机制 事实上时间是不存在的,是由人的感官系统感觉时间的存在而已,是一种虚幻的 ...
- Linux 服务器运行健康状况监控利器 Spotlight on Unix 的安装与使用
1.本文背景 1.1.Linux 服务器情况 # cat /etc/issueRed Hat Enterprise Linux Server release 6.1 (Santiago)Kernel ...
- Spark程序运行常见错误解决方法以及优化
转载自:http://bigdata.51cto.com/art/201704/536499.htm Spark程序运行常见错误解决方法以及优化 task倾斜原因比较多,网络io,cpu,mem都有可 ...
- Spark的 运行模式详解
Spark的运行模式是多种多样的,那么在这篇博客中谈一下Spark的运行模式 一:Spark On Local 此种模式下,我们只需要在安装Spark时不进行hadoop和Yarn的环境配置,只要将S ...
- SpringBoot第十二集:度量指标监控与异步调用(2020最新最易懂)
SpringBoot第十二集:度量指标监控与异步调用(2020最新最易懂) Spring Boot Actuator是spring boot项目一个监控模块,提供了很多原生的端点,包含了对应用系统的自 ...
- 图解JanusGraph系列 - JanusGraph指标监控报警(Monitoring JanusGraph)
大家好,我是洋仔,JanusGraph图解系列文章,实时更新~ 图数据库文章总目录: 整理所有图相关文章,请移步(超链):图数据库系列-文章总目录 源码分析相关可查看github(码文不易,求个sta ...
- 【03】SpringBoot2核心技术-核心功能—数据访问_单元测试_指标监控
3.数据访问(SQL) 3.1 数据库连接池的自动配置-HikariDataSource 1.导入JDBC场景 <dependency> <groupId>org.spring ...
- 业务监控-指标监控(v1)
最近做了指标监控系统的后台,包括需求调研.代码coding.调试调优测试等,穿插其他杂事等前后花了一个月左右. 指标监控指的是用户通过接口上传某些指标信息,并且通过配置阈值公式和告警规则等信息监测自己 ...
- 通过案例对 spark streaming 透彻理解三板斧之二:spark streaming运行机制
本期内容: 1. Spark Streaming架构 2. Spark Streaming运行机制 Spark大数据分析框架的核心部件: spark Core.spark Streaming流计算. ...
随机推荐
- Jenkins环境搭建(8)-邮件未能正常发送
昨天,在使用jenkins构建项目时,出现了个问题,问题是:jenkins控制台日志显示邮件发送成功,但实际没有成功. 此前,jenkins的配置,项目构建后,是能正常发送邮件的,可这次突然就不行了, ...
- 【题解】折纸 origami [SCOI2007] [P4468] [Bzoj1074]
[题解]折纸 origami [SCOI2007] [P4468] [Bzoj1074] 传送门:折纸 \(\text{origami [SCOI2007] [P4468]}\) \(\text{[B ...
- HBase数据导入导出工具
hbase中自带一些数据导入.导出工具 1. ImportTsv直接导入 1.1 hbase中建表 create 'testtable4','cf1','cf2' 1.2 准备数据文件data.txt ...
- [日常摸鱼]bzoj4802 欧拉函数-PollardRho大整数分解算法
啊居然要特判,卡了好久QAQ (好像Windows下的rand和Linux下的不一样? QwQ一些东西参考了喵铃的这篇blog:http://www.cnblogs.com/meowww/p/6400 ...
- ECharts的下载和安装(图文详解)
首先搜索找到ECharts官网,点击进入. 找到下载 进入就看到第三步,就点击在线制作 点击进入之后就自己可以选择里面的形状图,就在线制作.最后生成echarts.min.js 点击下载后就会生成js ...
- Angular *ngIf length
Angular *ngIf length 在Angular中如何判断*ngIf Arrary的长度? 具体代码如下: result = []; <div class="kt-secti ...
- 多年总结IDEA 使用技巧 (建议收藏!)
很长一段时间没有更新了,前段时间转测试了,浪费了一些时间,终于可以写文章了,今天来写一下之前自己开发的一些习惯,因为自己本身自己是一个极简主义所以 开发喜欢这样:. 全屏显示 我们可以使用[Prese ...
- 这个 bug 让我更加理解 Spring 单例了
我是风筝,公众号「古时的风筝」,一个兼具深度与广度的程序员鼓励师,一个本打算写诗却写起了代码的田园码农! 文章会收录在 JavaNewBee 中,更有 Java 后端知识图谱,从小白到大牛要走的路都在 ...
- 蒲公英 · JELLY技术周刊 Vol.35: Flash 四宗罪?
蒲公英 · JELLY技术周刊 Vol.35 Flash 曾是 Web 迈向新世代的福音书,它为这个世界带来了缤纷色彩,但也如伊甸园的苹果,闪耀着智慧的光芒,却四灾随行.诞生 1995 年至今 25 ...
- OC 大数组分割成由小数组重新组合的新数组
NSLog(@"++++%@",[self seprateBigArrBySize:3 BigArr:@[@"1",@"2",@" ...