错误分析

堆栈信息中有一个错误信息:Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 264, idc-xx-xx-3-30.d.xx.com, executor 2): java.lang.OutOfMemoryError: Java heap space

根据提示信息可以得到以下几点

  • stage由一堆task组成,也就是taskset,编号1的task在stage2中失败了4次
  • executor 是实际执行task的节点,编号2的executor发生了Java heap space
  • executor 内存配置的是512M,没有配置 spark.executor.memoryOverhead,spark在计算executor最终需要分配多少内存时有以下机制
  1. 未配置spark.executor.memoryOverhead来直接控制off-heap时(堆外内存,将对象序列化后放在一大块gc不会直接管理的内存中,需要的时候再反序列化使用,就像放到磁盘上一样,此处堆外内存包含了方法区,直接内存,虚拟机栈,本地方法栈)

    realMem = executorMemory[heap] + (executorMemory * 0.10, with minimum of 384)[off-heap]

    2)配置spark.executor.memoryOverhead

    realMem = executorMemory[heap] + memoryOverhead[off-heap]

readMem表示java进程需要申请的总内存,如果超过container的内存容量,会被直接kill掉

异常种类

  • OutOfMemoryError: Java heap space,堆内存不足,溢出,需调整--executor-memory
  • OutOfMemoryError: Java permgen space,堆外内存不足,溢出,需调整spark.executor.memoryOverhead

下述异常属于Java heap space,调整--executor-memory

RDD的位置,根据MemoryMode可以选择是堆内或堆外

日志中查看到的异常信息

: org.apache.spark.SparkException: Job aborted.

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)

at org.apache.spark.sql.Dataset.(Dataset.scala:185)

at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:280)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:214)

at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 264, idc-xx-xx-3-30.d.xx.com, executor 2): java.lang.OutOfMemoryError: Java heap space

at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)

at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)

at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)

at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)

at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)

at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)

at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:132)

at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)

at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1005)

at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)

at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)

at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)

at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)

at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

导致异常的代码

/**
* @param f file to read the chunks from
* @return the chunks
* @throws IOException
*/
public List<Chunk> readAll(FSDataInputStream f) throws IOException {
List<Chunk> result = new ArrayList<Chunk>(chunks.size());
f.seek(offset);
byte[] chunksBytes = new byte[length]; //778行,分配长为length的byte[]时没有足够的可用内存导致heap space
f.readFully(chunksBytes);
// report in a counter the data we just scanned
BenchmarkCounter.incrementBytesRead(length);
int currentChunkOffset = 0;
for (int i = 0; i < chunks.size(); i++) {
ChunkDescriptor descriptor = chunks.get(i);
if (i < chunks.size() - 1) {
result.add(new Chunk(descriptor, chunksBytes, currentChunkOffset));
} else {
// because of a bug, the last chunk might be larger than descriptor.size
result.add(new WorkaroundChunk(descriptor, chunksBytes, currentChunkOffset, f));
}
currentChunkOffset += descriptor.size;
}
return result ;
}

Spark执行失败时的一个错误分析的更多相关文章

  1. Hadoop概念学习系列之为什么hadoop/spark执行作业时,输出路径必须要不存在?(三十九)

    很多人只会,但没深入体会和想为什么要这样? 拿Hadoop来说,当然,spark也一样的道理. 输出路径由Hadoop自己创建,实际的结果文件遵守part-nnnn的约定. 如何指定一个已有目录作为H ...

  2. 程序中使用事务来管理sql语句的执行,执行失败时,可以达到回滚的要求。

    1.设置使用事务的SQL执行语句 /// <summary> /// 使用有事务的SQL语句 /// </summary> /// <param name="s ...

  3. svn cleanup 执行失败时,可以勾选 break locks,

  4. 【转】mysql触发器的实战(触发器执行失败,sql会回滚吗)

    1   引言Mysql的触发器和存储过程一样,都是嵌入到mysql的一段程序.触发器是mysql5新增的功能,目前线上凤巢系统.北斗系统以及哥伦布系统使用的数据库均是mysql5.0.45版本,很多程 ...

  5. ThreadPoolExecutor线程池任务执行失败的时候会怎样

    接上一篇 <JDK1.8中的线程池> 1.  任务执行失败时的处理逻辑 1.1.  Worker Worker相当于线程池中的线程 可以看到,Worker有几个重要的属性: thread ...

  6. pybot执行多条用例时,某一个用例执行失败,停止所有用例的执行

    问题: pybot执行多条用例时,某一个用例执行失败,停止所有用例的执行 解决办法: pybot -exitonfailure E:\robot\呼送项目\测试用例\基本流程\主流程.txt 参考文章 ...

  7. vs2015 生成项目时,提示执行失败,参数错误

    今天vs2015 生成项目时,提示执行失败,参数错误.查了很多资料未解决 后来,发现只有一个项目出现这个问题,其他项目生成正常.怀疑是该项目解决方案的问题 于是将解决项目中的项目移除,逐一生成引用,解 ...

  8. spark 在yarn执行job时一直抱0.0.0.0:8030错误

    近日新写完的spark任务放到yarn上面执行时,在yarn的slave节点中一直看到报错日志:连接不到0.0.0.0:8030 . The logs are as below: 2014-08-11 ...

  9. 大数据学习day23-----spark06--------1. Spark执行流程(知识补充:RDD的依赖关系)2. Repartition和coalesce算子的区别 3.触发多次actions时,速度不一样 4. RDD的深入理解(错误例子,RDD数据是如何获取的)5 购物的相关计算

    1. Spark执行流程 知识补充:RDD的依赖关系 RDD的依赖关系分为两类:窄依赖(Narrow Dependency)和宽依赖(Shuffle Dependency) (1)窄依赖 窄依赖指的是 ...

随机推荐

  1. python-逻辑结构操作

    0x01 大纲 逻辑结构 list dict 判断if else elif break continue while 0x02 添加 list = [i for i in range(0,10)] p ...

  2. Java Comparable 和 Comparator 接口详解

    本文基于 JDK8 分析 Comparable Comparable 接口位于 java.lang 包下,Comparable 接口下有一个 compareTo 方法,称为自然比较方法.一个类只要实现 ...

  3. Azure Storage 系列(七)使用Azure File Storage

    一,引言 今天我们开始介绍 Storage 中的最后一个类型的存储----- File Storage(文件存储),Azure File Storage 在云端提供完全托管的文件共享,这些共享项可通过 ...

  4. Python-字符编码-Unicode UTF-8

    什么是字符编码? --世界上有很多国家,每个国家都有自己独特的语言,所以在计算机普及的当今世界, 每个国家都有自己的字符编码,本国的软件运行在其他国家的机器上,会出现乱码, 有utf-8,gbk等各种 ...

  5. Book of Shaders 03 - 学习随机与噪声生成算法

    0x00 随机 我们不能预测天空中乌云的样子,因为它的纹理总是具有不可预测性.这种不可预测性叫做随机 (random). 在计算机图形学中,我们通常使用随机来模拟自然界中的噪声.如何获得一个随机值呢, ...

  6. STM32的CCM RAM

    STM32F407ZGT6的Flash大小为1MB,SRAM大小为(128KB+64KB). 这里SRAM之所以分开表示是因为在芯片内部前面的128KB和后面的64KB地址不是连续的,后面的64KB在 ...

  7. Copy As HTML From VSCode

    JS生成可自定义语法高亮HTMLcode cnblogs @ Orcim  !deprecated! 这里有更好的方案,具体看我的这篇博客博客代码高亮的另一种思路 这篇文章介绍了如何在博客里插入一段 ...

  8. dsu on tree 入门

    Dus on tree 树上并查集?. 啊这,并不是的啦,他利用了树上启发式合并的思想. 他主要解决不带修改且主要询问子树信息的树上问题. 先来看到例题,CF600E . 这不就是树上莫队的经典题吗? ...

  9. 例题4-2 刽子手游戏(Hangman Judge, UVa 489)

    #include<stdio.h> #include<string.h> int ok ,no; int left ,chance; char s[20] ,s2[20]; v ...

  10. k8s node上查看节点

    node执行 mkdir -p /root/.kube master执行 scp admin.conf node1:/root/.kube/config