1. Scenario description

when I use sqoop to import mysql table into hive, I got the following error:

// :: WARN hcat.SqoopHCatUtilities: The Sqoop job can fail if types are not  assignment compatible
// :: WARN hcat.SqoopHCatUtilities: The HCatalog field submername has type string. Expected = varchar based on database column type : VARCHAR
// :: WARN hcat.SqoopHCatUtilities: The Sqoop job can fail if types are not assignment compatible
// :: INFO mapreduce.DataDrivenImportJob: Configuring mapper for HCatalog import job
// :: INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
// :: INFO client.RMProxy: Connecting to ResourceManager at hadoop-namenode01/192.168.1.101:
// :: WARN conf.HiveConf: HiveConf of name hive.server2.webui.host.port does not exist
// :: INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
// :: INFO db.DBInputFormat: Using read commited transaction isolation
// :: INFO mapreduce.JobSubmitter: number of splits:
// :: INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1562229385371_50086
// :: INFO impl.YarnClientImpl: Submitted application application_1562229385371_50086
// :: INFO mapreduce.Job: The url to track the job: http://hadoop-namenode01:8088/proxy/application_1562229385371_50086/
// :: INFO mapreduce.Job: Running job: job_1562229385371_50086
// :: INFO hive.metastore: Closed a connection to metastore, current connections:
// :: INFO mapreduce.Job: Job job_1562229385371_50086 running in uber mode : false
// :: INFO mapreduce.Job: map % reduce %
// :: INFO mapreduce.Job: Task Id : attempt_1562229385371_50086_m_000000_0, Status : FAILED
Error: GC overhead limit exceeded

Why Sqoop Import throws this exception?
The answer is – During the process, RDBMS database (NOT SQOOP) fetches all the rows at one shot and tries to load everything into memory. This causes memory spill out and throws error. To overcome this you need to tell RDBMS database to return the data in batches. The following parameters “?dontTrackOpenResources=true&defaultFetchSize=10000&useCursorFetch=true” following the jdbc connection string tells database to fetch 10000 rows per batch.

The script I use to import is as follows:

file sqoop_order_detail.sh

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect jdbc:mysql://lenmom-mysql:3306/inventory \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

the target mysql table has 10 billion record.

2.Solution:

2.1 solution 1

modify the mysql url to set stream read data style by append the following content:

?dontTrackOpenResources=true&defaultFetchSize=&useCursorFetch=true

of which the defaultFetchSize can be changed according to specific condition,in my case, the whole script is :

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect jdbc:mysql://lenmom-mysql:3306/inventory?dontTrackOpenResources=true\&defaultFetchSize=10000\&useCursorFetch=true\&useUnicode=yes\&characterEncoding=utf8\&characterEncoding=utf8 \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

don't  forget to use escape for & in shell script, or we can also use "jdbc url" to instead of using escape.

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect "jdbc:mysql://lenmom-mysql:3306/inventory?dontTrackOpenResources=true&defaultFetchSize=10000&useCursorFetch=true&useUnicode=yes&characterEncoding=utf8&characterEncoding=utf8" \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

2.2 Solution 2

sqoop import -Dmapreduce.map.memory.mb= -Dmapreduce.map.java.opts=-Xmx1600m -Dmapreduce.task.io.sort.mb=

Above parameters needs to be tuned according to the data for a successful SQOOP pull.

2.3 Solution 3

increase mapper number(the default mapper number is 4, should not greater than datanode number)

sqoop job --exec lenmom-job -- --num-mappers ;

reference:

https://stackoverflow.com/questions/26484873/cloudera-settings-sqoop-import-gives-java-heap-space-error-and-gc-overhead-limit

sqoop import mysql to hive table:GC overhead limit exceeded的更多相关文章

  1. troubleshooting-sqoop mysql导入hive 报:GC overhead limit exceeded

    Halting due to Out Of Memory Error...18/09/13 21:42:17 INFO mapreduce.Job: Task Id : attempt_1536756 ...

  2. java.lang.OutOfMemoryError:GC overhead limit exceeded填坑心得

    我遇到这样的问题,本地部署时抛出异常java.lang.OutOfMemoryError:GC overhead limit exceeded导致服务起不来,查看日志发现加载了太多资源到内存,本地的性 ...

  3. [转]java.lang.OutOfMemoryError:GC overhead limit exceeded

    我遇到这样的问题,本地部署时抛出异常java.lang.OutOfMemoryError:GC overhead limit exceeded导致服务起不来,查看日志发现加载了太多资源到内存,本地的性 ...

  4. java.lang.OutOfMemoryError:GC overhead limit exceeded

    在调测程序时报java.lang.OutOfMemoryError:GC overhead limit exceeded 错误 错误原因:在用程序进行数据切割时报了该错误.由于在本地执行数据切割测试的 ...

  5. Android:java.lang.OutOfMemoryError:GC overhead limit exceeded

    Android编译:java.lang.OutOfMemoryError:GC overhead limit exceeded 百度好多什么JVM啊之类的东西,新手简单粗暴的办法: 1.在的Model ...

  6. oozie: GC overhead limit exceeded 解决方法

    1.异常表现形式 1)  提示信息      Error java.lang.OutOfMemoryError: GC overhead limit exceeded 2)提示出错      Erro ...

  7. java.lang.OutOfMemoryError:GC overhead limit exceeded解决方法

    异常如下:Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded 一.解 ...

  8. java.lang.OutOfMemoryError:GC overhead limit exceeded解决方

    Tomcat异常信息: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee ...

  9. solr索引报错(java.lang.OutOfMemoryError:GC overhead limit exceeded)

    配置文件修改如下: <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3 ...

随机推荐

  1. 1-html基本结构与编写规范

    <!DOCTYPE html> <html> <!-- #前端开发系统化学习教程, #包括html.css.pc端及移动端布局技巧.javascript. #jquery ...

  2. 一般spring配置上下文

    <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://mave ...

  3. LightOJ - 1170 - Counting Perfect BST(卡特兰数)

    链接: https://vjudge.net/problem/LightOJ-1170 题意: BST is the acronym for Binary Search Tree. A BST is ...

  4. 【CSP-S 2019】【洛谷P5658】括号树【dfs】【二分】

    题目: 题目链接:https://www.luogu.org/problem/P5658?contestId=24103 本题中合法括号串的定义如下: () 是合法括号串. 如果 A 是合法括号串,则 ...

  5. JAVA中使用Dom解析XML

    在G盘下新建XML文档:person.xml,XML代码: <?xml version="1.0" encoding="utf-8"?> <s ...

  6. Centos7 minimal 安装npm

    最小版本缺少很多源,需要手动去添加源 如何去判断yum中 有没有 npm 的源呢 yum list | grep npm 如果是这样的,就代表需要自己去添加 curl -sL -o /etc/yum. ...

  7. Canvas 总结,到第4章 canvas图形变换

    canvas 必须认识到的大坑 <!-- 重点: 在js/canvas标签中定义的宽和高是画布实际的宽和高. 在样式表中定义的宽和高是画布缩放后的宽和高. 即:把js/canvas实际大小缩放到 ...

  8. codevs:1313 质因数分解:已知正整数 n是两个不同的质数的乘积,试求出较大的那个质数 。

    #include<iostream>#include<cstdio>#include<cmath>using namespace std;int a[2];int ...

  9. CF Gym 102028G Shortest Paths on Random Forests

    CF Gym 102028G Shortest Paths on Random Forests 抄题解×1 蒯板子真jir舒服. 构造生成函数,\(F(n)\)表示\(n\)个点的森林数量(本题都用E ...

  10. 讨论关于RAID以及RAID对于存储的影响

    定义及作用 RAID是Redundent Array of Inexpensive Disks的缩写,直译为“廉价冗余磁盘阵列”,也简称为“磁盘阵列”.后来RAID中的字母I被改作了Independe ...