【原创】大叔问题定位分享(7)Spark任务中Job进度卡住不动
Spark2.1.1
最近运行spark任务时会发现任务经常运行很久,具体job如下:
Stages: Succeeded/Total |
Tasks (for all stages): Succeeded/Total |
||||
16 |
2018/12/03 12:39:50 |
2.3 h |
0/5 |
196/4723 |
job中正在运行的stage如下:
Tasks: Succeeded/Total |
||||||||
60 |
2018/12/03 12:39:57 |
2.3 h |
196/200 |
4.5 GB |
1455.1 MB |
该stage中有4个task一直处于running状态,这些task的统计信息异常(Input Size / Records和Shuffle Write Size / Records均为0.0B/0),并且这4个task都位于同一个executor上:
33 |
8938 |
0 |
RUNNING |
PROCESS_LOCAL |
12 / $executor_server_ip |
2018/12/03 12:39:57 |
2.3 h |
0.0 B / 0 |
0.0 B / 0 |
有问题的task所在的executor统计信息也有异常(Total Tasks为0),该executor如下:
12 |
$executor_server_ip:36755 |
0 ms |
0 |
0 |
0 |
0 |
0.0 B / 0 |
0.0 B / 0 |
此时Driver堆栈信息如下:
"Driver" #26 prio=5 os_prio=0 tid=0x00007f163a116000 nid=0x5192 waiting on condition [0x00007f15bb9a0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000001a8c4f9e0> (a scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1988)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
at org.apache.spark.rdd.RDD$$anonfun$treeReduce$1.apply(RDD.scala:1059)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeReduce(RDD.scala:1037)
at breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
at breeze.optimize.LineSearch$$anon$1.calculate(LineSearch.scala:41)
at breeze.optimize.LineSearch$$anon$1.calculate(LineSearch.scala:30)
at breeze.optimize.StrongWolfeLineSearch.breeze$optimize$StrongWolfeLineSearch$$phi$1(StrongWolfe.scala:69)
at breeze.optimize.StrongWolfeLineSearch$$anonfun$minimize$1.apply$mcVI$sp(StrongWolfe.scala:142)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at breeze.optimize.StrongWolfeLineSearch.minimize(StrongWolfe.scala:141)
at breeze.optimize.LBFGS.determineStepSize(LBFGS.scala:78)
at breeze.optimize.LBFGS.determineStepSize(LBFGS.scala:40)
at breeze.optimize.FirstOrderMinimizer$$anonfun$infiniteIterations$1.apply(FirstOrderMinimizer.scala:64)
at breeze.optimize.FirstOrderMinimizer$$anonfun$infiniteIterations$1.apply(FirstOrderMinimizer.scala:62)
at scala.collection.Iterator$$anon$7.next(Iterator.scala:129)
at breeze.util.IteratorImplicits$RichIterator$$anon$2.next(Implicits.scala:71)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at app.package.AppClass.main(AppClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
可见正在runJob,并且等待executor执行结果;
有问题的executor上堆栈信息有一个可疑的thread长时间一直在running:
"shuffle-client-5-4" #94 daemon prio=5 os_prio=0 tid=0x00007fbae0e42800 nid=0x2a3a runnable [0x00007fbae4760000]
java.lang.Thread.State: RUNNABLE
at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:476)
at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
at io.netty.util.Recycler.get(Recycler.java:144)
at io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
at io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
ps:出问题的executor上当时的内存资源很空闲,进程状态也正常:
-bash-4.2$ free -m
total used free shared buff/cache available
Mem: 257676 29251 5274 517 223150 226669
Swap: 0 0 0
怀疑此处可能有死循环,spark2.1.1使用的netty版本是4.0.42,查看netty代码:
io.netty.util.Recycler
boolean scavengeSome() { WeakOrderQueue cursor = this.cursor; if (cursor == null) { cursor = head; if (cursor == null) { return false; } } boolean success = false; WeakOrderQueue prev = this.prev; do { if (cursor.transfer(this)) { success = true; break; } WeakOrderQueue next = cursor.next; if (cursor.owner.get() == null) { // If the thread associated with the queue is gone, unlink it, after // performing a volatile read to confirm there is no data left to collect. // We never unlink the first queue, as we don't want to synchronize on updating the head. if (cursor.hasFinalData()) { for (;;) { if (cursor.transfer(this)) { success = true; } else { break; } } } if (prev != null) { prev.next = next; } } else { prev = cursor; } cursor = next; } while (cursor != null && !success); this.prev = prev; this.cursor = cursor; return success; }
问题在于cursor初始化的时候没有清空prev:
if (cursor == null) {
cursor = head;
该问题在4.0.43中被修复,升级spark2.1.1中的netty到4.0.43或以上版本可以修复问题;
官方issues位于:https://github.com/netty/netty/issues/6153
【原创】大叔问题定位分享(7)Spark任务中Job进度卡住不动的更多相关文章
- 【原创】大叔问题定位分享(18)beeline连接spark thrift有时会卡住
spark 2.1.1 beeline连接spark thrift之后,执行use database有时会卡住,而use database 在server端对应的是 setCurrentDatabas ...
- 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration
spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...
- 【原创】大叔问题定位分享(27)spark中rdd.cache
spark 2.1.1 spark应用中有一些task非常慢,持续10个小时,有一个task日志如下: 2019-01-24 21:38:56,024 [dispatcher-event-loop-2 ...
- 【原创】大叔问题定位分享(21)spark执行insert overwrite非常慢,比hive还要慢
最近把一些sql执行从hive改到spark,发现执行更慢,sql主要是一些insert overwrite操作,从执行计划看到,用到InsertIntoHiveTable spark-sql> ...
- 【原创】大叔问题定位分享(19)spark task在executors上分布不均
最近提交一个spark应用之后发现执行非常慢,点开spark web ui之后发现卡在一个job的一个stage上,这个stage有100000个task,但是绝大部分task都分配到两个execut ...
- 【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException
spark查orc格式的数据有时会报这个错 Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc. ...
- 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
- 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...
- 【原创】大叔问题定位分享(12)Spark保存文本类型文件(text、csv、json等)到hdfs时为什么是压缩格式的
问题重现 rdd.repartition(1).write.csv(outPath) 写文件之后发现文件是压缩过的 write时首先会获取hadoopConf,然后从中获取是否压缩以及压缩格式 org ...
随机推荐
- 基于 HTML5 WebGL 的 3D 棉花加工监控系统
前言 现在的棉花加工行业还停留在传统的反应式维护模式当中,当棉花加下厂的设备突然出现故障时,控制程序需要更换.这种情况下,首先需要客户向设备生产厂家请求派出技术人员进行维护,然后生产厂家才能根据情况再 ...
- go语言-值类型与引用类型
https://www.cnblogs.com/java-zhao/p/9942311.html https://blog.csdn.net/TDCQZD/article/details/826836 ...
- php中一些容易混淆的函数总结
在我们日常PHP开发中,经常会使用一些函数完成相关操作,但是有些函数功能相近,很容易混淆,再次总结一下 1. __DIR__ && getcwd() 看官方解释: getcwd: ...
- 「【算法进阶0x30】数学知识A」作业简洁总结
t1-Prime Distance 素数距离 大范围筛素数. t2-阶乘分解 欧拉筛素数后,按照蓝皮上的式子筛出素数. 复杂度:O(nlogn) t3-反素数ant 搜索 t4-余数之和 整除分块+容 ...
- luogu4705玩游戏
题解 我们要对于每个t,求一个(1/mn)sigma(ax+by)^t. 把系数不用管,把其他部分二项式展开一下: simga(ax^r*by^(t-r)*C(t,r)). 把组合数拆开,就变成了一个 ...
- Yii2.0 安装使用报错:yii\web\Request::cookieValidationKey must be configured with a secret key.
下载了Yii2.0的basic版,配置好apache之后,浏览器访问,出现如下错误: Invalid Configuration – yii\base\InvalidConfigException y ...
- idea搭建springboot
1.创建新项目 2.继续项目配置 Name:项目名称Type:我们是Maven构建的,那么选择第一个Maven ProjectPackaging:打包类型,打包成Jar文件Java Version: ...
- java.io.FileNotFoundException:my-release-key.keyStore拒绝访问
安卓生成APK的时候,生成密钥的时候报java.io.FileNotFoundException:my-release-key.keyStore拒绝访问的错误 这是因为权限问题:你的jdk目录在c盘, ...
- ansible基本使用方法
一.ansible的运行流程 ansible是基于ssh模块的软件,所以主控端和被控端的ssh服务必须正常才能保证ansbile软件的可用性. 检查ssh服务是否正常: systemctl sta ...
- 安装python caffe过程中遇到的一些问题以及对应的解决方案
关于系统环境: Ubuntu 16.04 LTS cuda 8.0 cudnn 6.5 Anaconda3 编译pycaffe之前需要配置文件Makefile.config ## Refer to h ...