ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...
: jdbc:hive2://master01.hadoop.dtmobile.cn:1> select * from cell_random_grid_tmp2 limit 1; INFO : Compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:grid_row_id, type:,), comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.045 seconds INFO : Executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): INFO : Completed executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.001 seconds INFO : OK Error: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at in file hdfs://master01.hadoop.dtmobile.cn:8020/user/hive/warehouse/capacity.db/cell_random_grid_tmp2/part-00000-82a689a5-7c2a-48a0-ab17-8bf04c963ea6-c000.snappy.parquet (state=,code=0) : jdbc:hive2://master01.hadoop.dtmobile.cn:1>
通过spark2.3 sparksql执行写数据到hive,saveAsTable(),sparksql写数据到hive时候,默认是保存为parquet+snappy的数据。在数据保存完成之后,通过hive beeline查询,报错如上。但是通过spark查询,执行正常。
在stackoverflow上找到同样的问题:
根本原因如下:
This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning
Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.
Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.
所以尝试调整参数 spark.sql.parquet.writeLegacyFormat = true,问题解决。
到spark2.3源代码中查找该参数(spark.sql.parquet.writeLegacyFormat):
package org.apache.spark.sql.internal 中 关于sparksql的默认配置 SQLConf.scala中相关描述如下
val PARQUET_WRITE_LEGACY_FORMAT = buildConf("spark.sql.parquet.writeLegacyFormat") .doc("Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior " + "versions, when converting Parquet schema to Spark SQL schema and vice versa.") .booleanConf .createWithDefault(false)
可以看到默认值为false
在 package org.apache.spark.sql.execution.datasources.parquet 的关于ParquetWriteSupport.scala 的描述如下:
/** * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet * messages. This class can write Parquet data in two modes: * * - Standard mode: Parquet data are written in standard format defined in parquet-format spec. * - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior. * * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`. The value * of this option is propagated to this class by the `init()` method and its Hadoop configuration * argument. */
ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...的更多相关文章
- python3: error while loading shared libraries: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory
安装python3遇到报错: wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz ./configure --prefix=/u ...
- svnadmin:error while loading shared libraries: libaprutil-1.so.0:cannot open shared object file: No such file or directory
wdcp下安装svn后一直提示 svnadmin:error while loading shared libraries: libaprutil-1.so.0:cannot open shared ...
- 动态链接库找不到 : error while loading shared libraries: libgsl.so.0: cannot open shared object file: No such file or directory
问题: 运行gsl(GNU scientific Library)的函数库,用 gcc erf.c -I/usr/local/include -L/usr/local/lib64 -L/usr/loc ...
- 解决libpython2.6.so.1.0: cannot open shared object file
文章解决的问题:安装nginx中需要Python2.6的支持,下面介绍如何安装Python2.6,并建立lib的连接. 问题展示:error while loading shared librarie ...
- ./filezilla: error while loading shared libraries: libpng12.so.0: cannot open shared object file: No such file or directory
opensuse系统 在filezilla官网下载压缩文件解压运行后报 ./filezilla: error while loading shared libraries: libpng12.so.0 ...
- error while loading shared libraries: libpthread.so.0: cannot open shared object file: No such file
安装rac10g,出现例如以下错误: [root@rac2 oracle]# /u01/product/crs/root.sh WARNING: directory '/u01/product' is ...
- tensorflow-gpu版本出现libcublas.so.8.0:cannot open shared object file
文章主要参考以下博客https://www.aliyun.com/zixun/wenji/1289957.html 在利用GPU加速tensorflow时,出现了libcublas.so.8.0:ca ...
- ubuntu下tensorflow 报错 libcusolver.so.8.0: cannot open shared object file: No such file or directory
解决方法1. 在终端执行: export LD_LIBRARY_PATH=”$LD_LIBRARY_PATH:/usr/local/cuda/lib64” export CUDA_HOME=/usr/ ...
- 〖Android〗arm-linux-androideabi-gdb报 libpython2.6.so.1.0: cannot open shared object file错误的解决方法
执行: prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.6/bin/arm-linux-androideabi-gdb out/target/p ...
随机推荐
- iis8 php-cgi.exe - FastCGI 进程意外退出 500错误解决办法
今天iis服务环境下的网站突然显示200错误php-cgi.exe - FastCGI 进程意外退出,昨天还好好的网站正常,这个问题一直偶尔出现几次,不是很频繁,但是偶尔会出现: 这是由于某些加载库加 ...
- bootstrap datatable editor 扩展
需求: a. 表单样式更改. b. 表单大小更改. 思路: a. 通过设置modal css更改样式和大小.缺点,全局性的更改. b. 更改bootstrap-editor,可以通过某种方式将参数传入 ...
- ASP.NET Core on K8S深入学习(1)K8S基础知识与集群搭建
在上一个小系列文章<ASP.NET Core on K8S学习初探>中,通过在Windows上通过Docker for Windows搭建了一个单节点的K8S环境,并初步尝试将ASP.NE ...
- go单元测试
testing模块 测试代码放在当前包以_test.go结尾的文件中 测试函数以Test为名称前缀 测试命令(go test) 正常编译操作(go build/install)会忽略测试文件 单例模式 ...
- dubbo文档笔记
配置覆盖关系 以 timeout 为例,显示了配置的查找顺序,其它 retries, loadbalance, actives 等类似: 方法级优先,接口级次之,全局配置再次之. 如果级别一样,则消费 ...
- CEPH 对象存储的系统池介绍
RGW抽象来看就是基于rados集群之上的一个rados-client实例. Object和pool简述 Rados集群网上介绍的文章很多,这里就不一一叙述,主要要说明的是object和pool.在r ...
- 高性能MySQL之事物
一.概念 事务到底是什么东西呢?想必大家学习的时候也是对事务的概念很模糊的.接下来通过一个经典例子讲解事务. 银行在两个账户之间转账,从A账户转入B账户1000元,系统先减少A账户的1000元,然后再 ...
- Maven安装配置及其插件m2e(Eclipse Indigo 和 MyEclipse8.5)的安装配置
Maven安装配置及其插件m2e(Eclipse Indigo 和 MyEclipse8.5)的安装配置 系统:Windows7 使用软件: Maven3.0.3 + Eclipse Indigo ...
- Source Maps简介
提高网站性能最简单的方式之一是合并压缩JavaScript和CSS文件.但是当你需要调试这些压缩文件中的代码时,那将会是一场噩梦.不过也不用担心,souce maps将会帮你解决这一问题. Sourc ...
- 安装node.js、webpack、vue 和vue-cli 以及安装速度慢/不成功的解决方法
1.安装node.js 地址:https://nodejs.org/en/ 下载安装软件之后,点击下一步即可 打开dos窗口,输入cmd能快速打开,输入npm -v 和 node -v 能显示出版本 ...