错误分析

堆栈信息中有一个错误信息:Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 264, idc-xx-xx-3-30.d.xx.com, executor 2): java.lang.OutOfMemoryError: Java heap space

根据提示信息可以得到以下几点

  • stage由一堆task组成,也就是taskset,编号1的task在stage2中失败了4次
  • executor 是实际执行task的节点,编号2的executor发生了Java heap space
  • executor 内存配置的是512M,没有配置 spark.executor.memoryOverhead,spark在计算executor最终需要分配多少内存时有以下机制
  1. 未配置spark.executor.memoryOverhead来直接控制off-heap时(堆外内存,将对象序列化后放在一大块gc不会直接管理的内存中,需要的时候再反序列化使用,就像放到磁盘上一样,此处堆外内存包含了方法区,直接内存,虚拟机栈,本地方法栈)

    realMem = executorMemory[heap] + (executorMemory * 0.10, with minimum of 384)[off-heap]

    2)配置spark.executor.memoryOverhead

    realMem = executorMemory[heap] + memoryOverhead[off-heap]

readMem表示java进程需要申请的总内存,如果超过container的内存容量,会被直接kill掉

异常种类

  • OutOfMemoryError: Java heap space,堆内存不足,溢出,需调整--executor-memory
  • OutOfMemoryError: Java permgen space,堆外内存不足,溢出,需调整spark.executor.memoryOverhead

下述异常属于Java heap space,调整--executor-memory

RDD的位置,根据MemoryMode可以选择是堆内或堆外

日志中查看到的异常信息

: org.apache.spark.SparkException: Job aborted.

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)

at org.apache.spark.sql.Dataset.(Dataset.scala:185)

at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:280)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:214)

at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 264, idc-xx-xx-3-30.d.xx.com, executor 2): java.lang.OutOfMemoryError: Java heap space

at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)

at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)

at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)

at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)

at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)

at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)

at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:132)

at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)

at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1005)

at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)

at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)

at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)

at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)

at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

导致异常的代码

/**
* @param f file to read the chunks from
* @return the chunks
* @throws IOException
*/
public List<Chunk> readAll(FSDataInputStream f) throws IOException {
List<Chunk> result = new ArrayList<Chunk>(chunks.size());
f.seek(offset);
byte[] chunksBytes = new byte[length]; //778行,分配长为length的byte[]时没有足够的可用内存导致heap space
f.readFully(chunksBytes);
// report in a counter the data we just scanned
BenchmarkCounter.incrementBytesRead(length);
int currentChunkOffset = 0;
for (int i = 0; i < chunks.size(); i++) {
ChunkDescriptor descriptor = chunks.get(i);
if (i < chunks.size() - 1) {
result.add(new Chunk(descriptor, chunksBytes, currentChunkOffset));
} else {
// because of a bug, the last chunk might be larger than descriptor.size
result.add(new WorkaroundChunk(descriptor, chunksBytes, currentChunkOffset, f));
}
currentChunkOffset += descriptor.size;
}
return result ;
}

Spark执行失败时的一个错误分析的更多相关文章

  1. Hadoop概念学习系列之为什么hadoop/spark执行作业时,输出路径必须要不存在?(三十九)

    很多人只会,但没深入体会和想为什么要这样? 拿Hadoop来说,当然,spark也一样的道理. 输出路径由Hadoop自己创建,实际的结果文件遵守part-nnnn的约定. 如何指定一个已有目录作为H ...

  2. 程序中使用事务来管理sql语句的执行,执行失败时,可以达到回滚的要求。

    1.设置使用事务的SQL执行语句 /// <summary> /// 使用有事务的SQL语句 /// </summary> /// <param name="s ...

  3. svn cleanup 执行失败时,可以勾选 break locks,

  4. 【转】mysql触发器的实战(触发器执行失败,sql会回滚吗)

    1   引言Mysql的触发器和存储过程一样,都是嵌入到mysql的一段程序.触发器是mysql5新增的功能,目前线上凤巢系统.北斗系统以及哥伦布系统使用的数据库均是mysql5.0.45版本,很多程 ...

  5. ThreadPoolExecutor线程池任务执行失败的时候会怎样

    接上一篇 <JDK1.8中的线程池> 1.  任务执行失败时的处理逻辑 1.1.  Worker Worker相当于线程池中的线程 可以看到,Worker有几个重要的属性: thread ...

  6. pybot执行多条用例时,某一个用例执行失败,停止所有用例的执行

    问题: pybot执行多条用例时,某一个用例执行失败,停止所有用例的执行 解决办法: pybot -exitonfailure E:\robot\呼送项目\测试用例\基本流程\主流程.txt 参考文章 ...

  7. vs2015 生成项目时,提示执行失败,参数错误

    今天vs2015 生成项目时,提示执行失败,参数错误.查了很多资料未解决 后来,发现只有一个项目出现这个问题,其他项目生成正常.怀疑是该项目解决方案的问题 于是将解决项目中的项目移除,逐一生成引用,解 ...

  8. spark 在yarn执行job时一直抱0.0.0.0:8030错误

    近日新写完的spark任务放到yarn上面执行时,在yarn的slave节点中一直看到报错日志:连接不到0.0.0.0:8030 . The logs are as below: 2014-08-11 ...

  9. 大数据学习day23-----spark06--------1. Spark执行流程(知识补充:RDD的依赖关系)2. Repartition和coalesce算子的区别 3.触发多次actions时,速度不一样 4. RDD的深入理解(错误例子,RDD数据是如何获取的)5 购物的相关计算

    1. Spark执行流程 知识补充:RDD的依赖关系 RDD的依赖关系分为两类:窄依赖(Narrow Dependency)和宽依赖(Shuffle Dependency) (1)窄依赖 窄依赖指的是 ...

随机推荐

  1. RDS、DDS 和 GaussDB 理不清?看这一篇足够了!

    当前,华为云提供的数据库服务主要包括三大类:关系型数据库服务,非关系型数据库服务以及数据库工具服务.如下图所示: 关系型数据库和非关系型数据库均可分为开源和自研两大类.其中,自研数据库统一为Gauss ...

  2. maven打包(jar)类型错误

    maven项目打包测试环境时部署发现是开发环境.确认打包命令无误, 此情况下将target内容全部删除,重新打包即可.是全部删除.

  3. 每天一个dos命令-net.

    Rem:关于net命令相关的常用实例(如果cmd中执行net相关命令,报错:Access is denied. 可以右键cmd,以管理员身份运行即可!) 1.创建一个新账号:net user ifsf ...

  4. Spring Boot学习(四)常用注解

    一.注解对照表 注解 使用位置 作用  @Controller  类名上方  声明此类是一个SpringMVC Controller 对象,处理http请求  @RequestMapping  类或方 ...

  5. FastDFS 分布式文件系统详解

    什么是文件系统 文件系统是操作系统用于在磁盘或分区上组织文件的方法和数据结构.磁盘空间是什么样的我们并不清楚,但文件系统可以给我们呈现一个非常清晰的表象,我们可以创建.删除.修改和复制这些文件,而实现 ...

  6. C++枚举变量与switch

    转载:https://www.cnblogs.com/banmei-brandy/p/11263927.html 枚举类型和变量如何定义,下篇博客讲得十分详细: https://blog.csdn.n ...

  7. matlab中sum

    来源:https://ww2.mathworks.cn/help/matlab/ref/sum.html?searchHighlight=sum&s_tid=doc_srchtitle#btv ...

  8. DES加解密算法(C语言实现)

    DES加密和解密算法的实现(C语言) 主要是做个记录,害怕以后代码丢了,先放到这里了. DES再不进行介绍了,可以看上一篇的 DES 的python实现 转载请注明出处:https://www.cnb ...

  9. 深入浅出具有划时代意义的G1垃圾回收器

    G1诞生的背景 Garbage First(简称G1)收集器是垃圾收集器技术发展历史上的里程碑式的成果,它开创了收集器面向局部收集的设计思路和基于Region的内存布局形式.HotSpot开发团队最初 ...

  10. ansible-介绍

    常用自动化运维工具 CFengine Chef Puppet 基于Ruby开发,采用C/S架构,扩展性强,基于SSL认证 SaltStack 基于python开发,采用C/S架构,相对于puppet更 ...