ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...
: jdbc:hive2://master01.hadoop.dtmobile.cn:1> select * from cell_random_grid_tmp2 limit 1; INFO : Compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:grid_row_id, type:,), comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.045 seconds INFO : Executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): INFO : Completed executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.001 seconds INFO : OK Error: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at in file hdfs://master01.hadoop.dtmobile.cn:8020/user/hive/warehouse/capacity.db/cell_random_grid_tmp2/part-00000-82a689a5-7c2a-48a0-ab17-8bf04c963ea6-c000.snappy.parquet (state=,code=0) : jdbc:hive2://master01.hadoop.dtmobile.cn:1>
通过spark2.3 sparksql执行写数据到hive,saveAsTable(),sparksql写数据到hive时候,默认是保存为parquet+snappy的数据。在数据保存完成之后,通过hive beeline查询,报错如上。但是通过spark查询,执行正常。
在stackoverflow上找到同样的问题:
根本原因如下:
This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning
Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.
Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.
所以尝试调整参数 spark.sql.parquet.writeLegacyFormat = true,问题解决。
到spark2.3源代码中查找该参数(spark.sql.parquet.writeLegacyFormat):
package org.apache.spark.sql.internal 中 关于sparksql的默认配置 SQLConf.scala中相关描述如下
val PARQUET_WRITE_LEGACY_FORMAT = buildConf("spark.sql.parquet.writeLegacyFormat") .doc("Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior " + "versions, when converting Parquet schema to Spark SQL schema and vice versa.") .booleanConf .createWithDefault(false)
可以看到默认值为false
在 package org.apache.spark.sql.execution.datasources.parquet 的关于ParquetWriteSupport.scala 的描述如下:
/** * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet * messages. This class can write Parquet data in two modes: * * - Standard mode: Parquet data are written in standard format defined in parquet-format spec. * - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior. * * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`. The value * of this option is propagated to this class by the `init()` method and its Hadoop configuration * argument. */
ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...的更多相关文章
- python3: error while loading shared libraries: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory
安装python3遇到报错: wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz ./configure --prefix=/u ...
- svnadmin:error while loading shared libraries: libaprutil-1.so.0:cannot open shared object file: No such file or directory
wdcp下安装svn后一直提示 svnadmin:error while loading shared libraries: libaprutil-1.so.0:cannot open shared ...
- 动态链接库找不到 : error while loading shared libraries: libgsl.so.0: cannot open shared object file: No such file or directory
问题: 运行gsl(GNU scientific Library)的函数库,用 gcc erf.c -I/usr/local/include -L/usr/local/lib64 -L/usr/loc ...
- 解决libpython2.6.so.1.0: cannot open shared object file
文章解决的问题:安装nginx中需要Python2.6的支持,下面介绍如何安装Python2.6,并建立lib的连接. 问题展示:error while loading shared librarie ...
- ./filezilla: error while loading shared libraries: libpng12.so.0: cannot open shared object file: No such file or directory
opensuse系统 在filezilla官网下载压缩文件解压运行后报 ./filezilla: error while loading shared libraries: libpng12.so.0 ...
- error while loading shared libraries: libpthread.so.0: cannot open shared object file: No such file
安装rac10g,出现例如以下错误: [root@rac2 oracle]# /u01/product/crs/root.sh WARNING: directory '/u01/product' is ...
- tensorflow-gpu版本出现libcublas.so.8.0:cannot open shared object file
文章主要参考以下博客https://www.aliyun.com/zixun/wenji/1289957.html 在利用GPU加速tensorflow时,出现了libcublas.so.8.0:ca ...
- ubuntu下tensorflow 报错 libcusolver.so.8.0: cannot open shared object file: No such file or directory
解决方法1. 在终端执行: export LD_LIBRARY_PATH=”$LD_LIBRARY_PATH:/usr/local/cuda/lib64” export CUDA_HOME=/usr/ ...
- 〖Android〗arm-linux-androideabi-gdb报 libpython2.6.so.1.0: cannot open shared object file错误的解决方法
执行: prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.6/bin/arm-linux-androideabi-gdb out/target/p ...
随机推荐
- 【TensorFlow 3】mnist数据集:与Keras对比
在TF1.8之后Keras被当作为一个内置API:tf.keras. 并且之前的下载语句会报错. mnist = input_data.read_data_sets('MNIST_data',one_ ...
- ES 22 - Elasticsearch中如何进行日期(数值)范围查询
目录 1 范围查询的符号 2 数值范围查询 3 时间范围查询 3.1 简单查询示例 3.2 关于时间的数学表达式(date-math) 3.3 关于时间的四舍五入 4 日期格式化范围查询(format ...
- win10下nodejs的安装及配置
这里主要引用两篇文章,写的非常详细,也能解决你可能出现的问题 nodejs安装及配置 如何删除之前nodejs设置的 npm config set prefix .....
- 入门MySQL——架构篇
前言: 上篇文章我们介绍了入门MySQL的基本概念,看完上篇文章,相信你应该了解MySQL的前世今生了吧.本篇文章将带你从架构体系来学习MySQL.我认为学习MySQL架构体系应该是入门阶段必须的, ...
- 异步请求xhr、ajax、axios与fetch的区别比较
目录 1. XMLHttpRequest对象 2. jQuery ajax 3. axios 4. fetch 参考 why: 为什么会出现不同的方法呢? what: 这些都是异步请求数据的方法.在不 ...
- jdk1.8源码解析:HashMap底层数据结构之链表转红黑树的具体时机
本文从三个部分去探究HashMap的链表转红黑树的具体时机: 一.从HashMap中有关“链表转红黑树”阈值的声明: 二.[重点]解析HashMap.put(K key, V value)的源码: 三 ...
- Java8中的流操作-基本使用&性能测试
为获得更好的阅读体验,请访问原文:传送门 一.流(Stream)简介 流是 Java8 中 API 的新成员,它允许你以声明式的方式处理数据集合(通过查询语句来表达,而不是临时编写一个实现).这有点儿 ...
- MySQL Schema与数据类型优化
Schema与数据类型优化 选择优化的数据类型 1.更小的通常更好 更小的数据类型通常更快,因为它们占用更少的磁盘,内存和CPU缓存 2.简单就好 简单数据类型的操作通常需要更少的CPU周期.例如:整 ...
- 夯实Java基础(三)——面向对象之继承
1.继承概述 继承是Java面向对象的三大特征之一,是比较重要的一部分,与后面的多态有着直接的关系.继承就是子类继承父类的特征和行为,使得子类对象(实例)具有父类的实例域和方法,或子类从父类继承方法, ...
- 【Java例题】4.1 级数求和1
1. 计算级数之和: y=1-1/2+1/4-1/8+...+ (-1)^(n-1)/2^(n-1). 这里的"^"表示乘方. package chapter4; import j ...