【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException
spark查orc格式的数据有时会报这个错
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more
跟进代码
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
static enum SplitStrategyKind {
HYBRID,
BI,
ETL
}
... Context(Configuration conf) {
this.conf = conf;
minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {
splitStrategyKind = SplitStrategyKind.HYBRID;
} else {
LOG.info("Enforcing " + ss + " ORC split strategy");
splitStrategyKind = SplitStrategyKind.valueOf(ss);
} ...
switch(context.splitStrategyKind) {
case BI:
// BI strategy requested through config
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
case ETL:
// ETL strategy requested through config
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
default:
// HYBRID strategy
if (avgFileSize > context.maxSize) {
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
} else {
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
}
break;
}
org.apache.hadoop.hive.conf.HiveConf.ConfVars
HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
"This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
" as opposed to query execution (split generation does not read or cache file footers)." +
" ETL strategy is used when spending little more time in split generation is acceptable" +
" (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
" based on heuristics."),
The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.
可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足
if (avgFileSize > context.maxSize) {
则
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题
set hive.exec.orc.split.strategy=ETL
问题暂时解决,未完待续;
【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException的更多相关文章
- 【原创】大叔问题定位分享(24)hbase standalone方式启动报错
hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...
- 【原创】大叔问题定位分享(2)spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT
最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下: 18/03/15 21:50:36 116 ERROR ApplicationMaster91: Us ...
- 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
- 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...
- 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration
spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...
- 【原创】大叔问题定位分享(9)oozie提交spark任务报 java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer
oozie中支持很多的action类型,比如spark.hive,对应的标签为: <spark xmlns="uri:oozie:spark-action:0.1"> ...
- 【原创】大叔问题定位分享(8)提交spark任务报错 Caused by: java.lang.ClassNotFoundException: org.I0Itec.zkclient.exception.ZkNoNodeException
spark 2.1.1 一 问题重现 spark-submit --master local[*] --class app.package.AppClass --jars /jarpath/zkcli ...
- 【原创】大叔问题定位分享(29)datanode启动报错:50020端口被占用
集群中有一台datanode一直启动报错如下: java.net.BindException: Problem binding to [$server1:50020] java.net.BindExc ...
- 【原创】大叔问题定位分享(13)HBase Region频繁下线
问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...
随机推荐
- すぬけ君の塗り絵 / Snuke's Coloring AtCoder - 2068 (思维,排序,贡献)
Problem Statement We have a grid with H rows and W columns. At first, all cells were painted white. ...
- PHP获取项目所有控制器方法名称
PHP获取项目所有控制器方法名称 //获取模块下所有的控制器和方法写入到权限表 public function initperm() { $modules = array('admin'); //模块 ...
- CGPoint、CGSize、CGRect、CGRectEdge的详细使用
http://blog.sina.com.cn/s/blog_953e22700101r7lz.html 在CGGeometry.h里的 CGPoint.CGSize.CGRect.CGRectEdg ...
- js01-javascript语法标准和数据类型
语法规则 (1)JavaScript对换行.缩进.空格不敏感. 备注:每一条语句末尾要加上分号,虽然分号不是必须加的,但是为了程序今后要压缩,如果不加分号,压缩之后将不能运行. (2)所有的符号,都是 ...
- vue非父子组件之间的通信
https://www.cnblogs.com/chengduxiaoc/p/7099552.html //vm.$emit( event, arg ) //触发当前实例上的事件 //vm.$on ...
- python的基本流程控制
一:if判断语句 1.1 if判断语法之一 if条件: 子代码块 1.2 if判断语法之二 if条件: 子代码块 else: 子代码块 1.3 if判断语法之三 if条件: if条件: 子代码块 1. ...
- Spring MVC 使用介绍(九)—— 异常处理
一.概述 Spring MVC异常处理功能的作用为:捕捉处理器的异常,并映射到相应视图 有4种方式: SimpleMappingExceptionResolver:通过配置的方式实现异常处理,该方式简 ...
- 深入理解JVM(5)——垃圾收集和内存分配策略
1.垃圾收集对象 垃圾收集主要是针对堆和方法区进行. 程序计数器.虚拟机栈和本地方法栈这三个区域属于线程私有的,只存在于线程的生命周期内,线程结束之后也会消失,因此不需要对这三个区域进行垃圾回收. 哪 ...
- GIT-常规操作
本地安装git, 安装文件: Git客户端: 可百度搜索:GIT64位或GIT32位等关键字找到相应的版本进行下载. 本地地址:D:\20-git\Git-2.20.1-64-bit.exe 也可百度 ...
- 关于微信登录授权获取unionid的方法
前言:微信登录授权是目前普遍存在于小程序的,还有一种静默授权方式是微信提供的但是不推荐使用,由于不同设备登录openid是不同的那么我们应该怎样拿到一个唯一的ID呢,下面做分享 wxml代码 < ...