1. Scenario description

when I use sqoop to import mysql table into hive, I got the following error:

// :: WARN hcat.SqoopHCatUtilities: The Sqoop job can fail if types are not  assignment compatible
// :: WARN hcat.SqoopHCatUtilities: The HCatalog field submername has type string. Expected = varchar based on database column type : VARCHAR
// :: WARN hcat.SqoopHCatUtilities: The Sqoop job can fail if types are not assignment compatible
// :: INFO mapreduce.DataDrivenImportJob: Configuring mapper for HCatalog import job
// :: INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
// :: INFO client.RMProxy: Connecting to ResourceManager at hadoop-namenode01/192.168.1.101:
// :: WARN conf.HiveConf: HiveConf of name hive.server2.webui.host.port does not exist
// :: INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
// :: INFO db.DBInputFormat: Using read commited transaction isolation
// :: INFO mapreduce.JobSubmitter: number of splits:
// :: INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1562229385371_50086
// :: INFO impl.YarnClientImpl: Submitted application application_1562229385371_50086
// :: INFO mapreduce.Job: The url to track the job: http://hadoop-namenode01:8088/proxy/application_1562229385371_50086/
// :: INFO mapreduce.Job: Running job: job_1562229385371_50086
// :: INFO hive.metastore: Closed a connection to metastore, current connections:
// :: INFO mapreduce.Job: Job job_1562229385371_50086 running in uber mode : false
// :: INFO mapreduce.Job: map % reduce %
// :: INFO mapreduce.Job: Task Id : attempt_1562229385371_50086_m_000000_0, Status : FAILED
Error: GC overhead limit exceeded

Why Sqoop Import throws this exception?
The answer is – During the process, RDBMS database (NOT SQOOP) fetches all the rows at one shot and tries to load everything into memory. This causes memory spill out and throws error. To overcome this you need to tell RDBMS database to return the data in batches. The following parameters “?dontTrackOpenResources=true&defaultFetchSize=10000&useCursorFetch=true” following the jdbc connection string tells database to fetch 10000 rows per batch.

The script I use to import is as follows:

file sqoop_order_detail.sh

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect jdbc:mysql://lenmom-mysql:3306/inventory \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

the target mysql table has 10 billion record.

2.Solution:

2.1 solution 1

modify the mysql url to set stream read data style by append the following content:

?dontTrackOpenResources=true&defaultFetchSize=&useCursorFetch=true

of which the defaultFetchSize can be changed according to specific condition,in my case, the whole script is :

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect jdbc:mysql://lenmom-mysql:3306/inventory?dontTrackOpenResources=true\&defaultFetchSize=10000\&useCursorFetch=true\&useUnicode=yes\&characterEncoding=utf8\&characterEncoding=utf8 \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

don't  forget to use escape for & in shell script, or we can also use "jdbc url" to instead of using escape.

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect "jdbc:mysql://lenmom-mysql:3306/inventory?dontTrackOpenResources=true&defaultFetchSize=10000&useCursorFetch=true&useUnicode=yes&characterEncoding=utf8&characterEncoding=utf8" \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

2.2 Solution 2

sqoop import -Dmapreduce.map.memory.mb= -Dmapreduce.map.java.opts=-Xmx1600m -Dmapreduce.task.io.sort.mb=

Above parameters needs to be tuned according to the data for a successful SQOOP pull.

2.3 Solution 3

increase mapper number(the default mapper number is 4, should not greater than datanode number)

sqoop job --exec lenmom-job -- --num-mappers ;

reference:

https://stackoverflow.com/questions/26484873/cloudera-settings-sqoop-import-gives-java-heap-space-error-and-gc-overhead-limit

sqoop import mysql to hive table:GC overhead limit exceeded的更多相关文章

  1. troubleshooting-sqoop mysql导入hive 报:GC overhead limit exceeded

    Halting due to Out Of Memory Error...18/09/13 21:42:17 INFO mapreduce.Job: Task Id : attempt_1536756 ...

  2. java.lang.OutOfMemoryError:GC overhead limit exceeded填坑心得

    我遇到这样的问题,本地部署时抛出异常java.lang.OutOfMemoryError:GC overhead limit exceeded导致服务起不来,查看日志发现加载了太多资源到内存,本地的性 ...

  3. [转]java.lang.OutOfMemoryError:GC overhead limit exceeded

    我遇到这样的问题,本地部署时抛出异常java.lang.OutOfMemoryError:GC overhead limit exceeded导致服务起不来,查看日志发现加载了太多资源到内存,本地的性 ...

  4. java.lang.OutOfMemoryError:GC overhead limit exceeded

    在调测程序时报java.lang.OutOfMemoryError:GC overhead limit exceeded 错误 错误原因:在用程序进行数据切割时报了该错误.由于在本地执行数据切割测试的 ...

  5. Android:java.lang.OutOfMemoryError:GC overhead limit exceeded

    Android编译:java.lang.OutOfMemoryError:GC overhead limit exceeded 百度好多什么JVM啊之类的东西,新手简单粗暴的办法: 1.在的Model ...

  6. oozie: GC overhead limit exceeded 解决方法

    1.异常表现形式 1)  提示信息      Error java.lang.OutOfMemoryError: GC overhead limit exceeded 2)提示出错      Erro ...

  7. java.lang.OutOfMemoryError:GC overhead limit exceeded解决方法

    异常如下:Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded 一.解 ...

  8. java.lang.OutOfMemoryError:GC overhead limit exceeded解决方

    Tomcat异常信息: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee ...

  9. solr索引报错(java.lang.OutOfMemoryError:GC overhead limit exceeded)

    配置文件修改如下: <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3 ...

随机推荐

  1. python的优缺点。

    Python的定位是“优雅”.“明确”.“简单”,所以Python程序看上去总是简单易懂,初学者学Python,不但入门容易,而且将来深入下去,可以编写那些非常非常复杂的程序. 开发效率非常高,Pyt ...

  2. 模拟赛20181101 雅礼 Wearry 施工 蔬菜 联盟

    % Day2 Solution % Wearry % Stay determined! 施工    记 fif_{i}fi​ 表示考虑前 iii 个建筑, 并且第 iii 个建筑的高度不变的答案, 每 ...

  3. MySQL Create table as / Create table like

    a.create table like方式会完整地克隆表结构,但不会插入数据,需要单独使用insert into或load data方式加载数据 b.create table as  方式会部分克隆表 ...

  4. SpreadJS 复制行

    参考:https://www.cnblogs.com/yeyuqian/p/10750441.html 核心代码: //例子:复制第一行(10列) 复制到 第二行var fromRanges = ne ...

  5. linux (core dump)调试

    转自 http://www.cnblogs.com/hazir/p/linxu_core_dump.html Linux Core Dump 当程序运行的过程中异常终止或崩溃,操作系统会将程序当时的内 ...

  6. 2019acm山东省赛C题

    传送门:http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode=4115 昨天赛场上只想到了一种情况:最远点一定是在最后一次循环中产生 ...

  7. Android Studio导入google training example gradle失败

    Error:Unable to tunnel through proxy. Proxy returns "HTTP/1.1 400 Bad Request 每次从github的Google ...

  8. Pytest权威教程24-Pytest导入机制及系统路径

    目录 Pytest导入机制和sys.path/PYTHONPATH 包中的测试脚本及conftest.py文件 独立测试模块及conftest.py文件 调用通过python -m pytest调用p ...

  9. spark_API

    1.概述 总的来讲,每一个spark驱动程序应用都由一个驱动程序组成,该驱动程序包含一个由用户编写的main方法,该方法会在集群上执行一些并行计算操作.Spark最重要的一个概念是弹性分布式数据集,简 ...

  10. ubuntu之路——day7.4 梯度爆炸和梯度消失、初始化权重、梯度的数值逼近和梯度检验

    梯度爆炸和梯度消失: W[i] > 1:梯度爆炸(呈指数级增长) W[i] < 1:梯度消失(呈指数级衰减) *.注意此时的1指单位矩阵,W也是系数矩阵 初始化权重: np.random. ...