问题复现:

G:\bigdata\spark-2.3.3-bin-hadoop2.7\bin>spark-shell
2020-12-26 10:20:48 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://DESKTOP-01KN1P4:4040
Spark context available as 'sc' (master = local[*], app id = local-1608949256544).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.3
/_/ Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information. scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show
++
||
++
++ (其他窗口新建一个空文件) touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show java.lang.RuntimeException: serious problem
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 49 elided
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 99 more

该问题的主要原因是在读取orc表时,遇到有空文件时报错,bug记录地址:

SPARK-19809:NullPointerException on zero-size ORC file(https://issues.apache.org/jira/browse/SPARK-19809)

SPARK-29773:Unable to process empty ORC files in Hive Table using Spark SQL(https://issues.apache.org/jira/browse/SPARK-29773)

解决办法:使用参数spark.sql.hive.convertMetastoreOrc=true

G:\bigdata\spark-2.3.3-bin-hadoop2.7\bin>spark-shell --conf spark.sql.hive.convertMetastoreOrc=true
2020-12-26 10:29:06 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://DESKTOP-01KN1P4:4040
Spark context available as 'sc' (master = local[*], app id = local-1608949754291).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.3
/_/ Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information. scala> sql("select * from empty_orc").show +---+
| a|
+---+
+---+

spark的帮助文档种介绍如下:

ORC Files

Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader is set to true. For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is also set to true.

https://spark.apache.org/docs/2.3.3/sql-programming-guide.html#orc-files

spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo的更多相关文章

  1. sparksql读取hive数据报错:java.lang.RuntimeException: serious problem

    问题: Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index: ...

  2. shiro使用redis作为缓存,出现要清除缓存时报错 java.lang.Exception: Failed to deserialize at org.crazycake.shiro.SerializeUtils.deserialize(SerializeUtils.java:41) ~[shiro-redis-2.4.2.1-RELEASE.jar:na]

    shiro使用redis作为缓存,出现要清除缓存时报错 java.lang.Exception: Failed to deserialize at org.crazycake.shiro.Serial ...

  3. 使用RestTemplate时报错java.lang.IllegalStateException: No instances available for 127.0.0.1

    我在RestTemplate的配置类里使用了 @LoadBalanced@Componentpublic class RestTemplateConfig { @Bean @LoadBalanced ...

  4. 云笔记项目- 上传文件报错"java.lang.IllegalStateException: File has been moved - cannot be read again"

    在做文件上传时,当写入上传的文件到文件时,会报错“java.lang.IllegalStateException: File has been moved - cannot be read again ...

  5. hive启动时报错 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D at org.apache.hadoop.fs.Path.initialize

    错误提示信息如下 错误信息如下 [root@node1 bin]# ./hive Logging initialized -bin/lib/hive-common-.jar!/hive-log4j.p ...

  6. 运用反射时报错java.lang.NoSuchMethodException,以解决,记录一下

    问题:想调用service类中的私有方法时, Method target=clz.getMethod("say", String.class);用Class的getMethod报错 ...

  7. storm supervisor启动报错java.lang.RuntimeException: java.io.EOFException

    storm因机器断电或其他异常导致的supervisor意外终止,再次启动时报错: 1. 2013-09-24 09:15:44,361 INFO [main] daemon.supervisor ( ...

  8. Android Studio 首次安装报错 Java.lang.RuntimeException:java.lang.NullPointerException...错

    下次安装报:Java.lang.RuntimeException: java.lang.NullPointerException......错 只需在文件..\Android Studio\bin\i ...

  9. 我的Android进阶之旅------>Android中MediaRecorder.stop()报错 java.lang.RuntimeException: stop failed.【转】

    本文转载自:http://blog.csdn.net/ouyang_peng/article/details/48048975 今天在调用MediaRecorder.stop(),报错了,java.l ...

  10. maven报错 java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors

    2 errors java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, ...

随机推荐

  1. 你以为这是MacOS ,其实这是我的 Linux 系统 Manjaro!

    对于如何将你的 Manjaro 系统美化成 MacOS 你需要做以下几件事情: 1.安装 WhiteSur-Gtk-theme 主题. 2.安装 Plank 软件. 3.安装 vala-panel-a ...

  2. 【译】 双向流传输和ASP.NET Core 3.0上的gRPC简介

    原文相关 原文作者:Eduard Los 原文地址:https://medium.com/@eddyf1xxxer/bi-directional-streaming-and-introduction- ...

  3. 2020年了,再不会webpack敲得代码就不香了(近万字实战)

    https://zhuanlan.zhihu.com/p/99959392?utm_source=wechat_session&utm_medium=social&utm_oi=619 ...

  4. vue结合element-ui实现多层复选框checkbox

    1.需求如上图所以: html相关代码如下: 1 <div class="intent-course-wrapper"> 2 <div class="c ...

  5. SpringBoot 集成短信和邮件

    准备工作 1.集成邮件 以QQ邮箱为例 在发送邮件之前,要开启POP3和SMTP协议,需要获得邮件服务器的授权码,获取授权码: 1.设置>账户 在账户的下面有一个开启SMTP协议的开关并进行密码 ...

  6. 机器学习-无监督机器学习-SVD奇异值分解-24

    [POC] 1. 奇异值分解的本质 特征值分解只能够对于方阵提取重要特征, Ax=λx λ为特征值 x为对应的特征向量 奇异值分解可以对于任意矩阵: 注意看中间的矩阵是一个对角矩阵,颜色越深越起作用- ...

  7. spring--CGLIB动态代理的实现原理

    CGLIB(Code Generation Library)是一个强大的.高性能.高质量的代码生成库,它可以在运行时扩展 Java 类和实现 Java 接口.CGLIB 动态代理是基于继承的方式来实现 ...

  8. Python-函数-算术函数

    #python-函数-算术函数 #(1)加减乘除 #加法 add(),减法 subtract(),乘法 multiply(),除法 divide() #作用:数组间的加减乘除 import numpy ...

  9. 一个WPF开发的打印对话框-PrintDialogX

    今天五月一号,大家玩的开心哦. 1. 介绍 今天介绍一个WPF开发的打印对话框开源项目-PrintDialogX,该开源项目由<WPF开源项目:AIStudio.Wpf.AClient>作 ...

  10. Vue事件方法中this.属性名

    vue事件方法中访问data对象中的成员 : this.属性名 注意: 如果事件处理代码没有写到methods中,而是写在行内则不需要this.