dataX调优

标签(空格分隔): ETL


一,Datax调优方向

DataX调优要分成几个部分(注:此处任务机指运行Datax任务所在的机器)。

1,网络本身的带宽等硬件因素造成的影响;

2,DataX本身的参数;

3,从源端到任务机;

4,从任务机到目的端;

即当觉得DataX传输速度慢时,需要从上述四个方面着手开始排查。

1,网络带宽等硬件因素调优

此部分主要需要了解网络本身的情况,即从源端到目的端的带宽是多少(实际带宽计算公式),平时使用量和繁忙程度的情况,从而分析是否是本部分造成的速度缓慢。以下提供几个思路。

1,可使用从源端到目的端scp,python http,nethogs等观察实际网络及网卡速度;

2,结合监控观察任务运行时间段时,网络整体的繁忙情况,来判断是否应将任务避开网络高峰运行;

3,观察任务机的负载情况,尤其是网络和磁盘IO,观察其是否成为瓶颈,影响了速度;

2,DataX本身的参数调优

全局

{
"core":{
"transport":{
"channel":{
"speed":{
"channel": 2, ## 此处为数据导入的并发度,建议根据服务器硬件进行调优
"record":-1, ##此处解除对读取行数的限制
"byte":-1, ##此处解除对字节的限制
"batchSize":2048 ##每次读取batch的大小
}
}
}
},
"job":{
...
}
}

局部

"setting": {
"speed": {
"channel": 2,
"record":-1,
"byte":-1,
"batchSize":2048
}
}
}
} # channel增大,为防止OOM,需要修改datax工具的datax.py文件。
# 如下所示,可根据任务机的实际配置,提升-Xms与-Xmx,来防止OOM。
# tunnel并不是越大越好,过分大反而会影响宿主机的性能。
DEFAULT_JVM = "-Xms1g -Xmx1g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=%s/log" % (DATAX_HOME)

Jvm 调优

python datax.py  --jvm="-Xms3G -Xmx3G" ../job/test.json

此处根据服务器配置进行调优,切记不可太大!否则直接Exception

以上为调优,应该是可以针对每个json文件都可以进行调优。

3,功能测试和性能测试

quick start https://github.com/alibaba/DataX/blob/master/userGuid.md

3.1 动态传参

如果需要导入数据的表太多而表的格式又相同,可以进行json文件的复用,举个简单的例子: python datax.py -p “-Dsdbname=test -Dstable=test” ../job/test.json

"column": ["*"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://xxx:xx/${sdbname}?characterEncoding=utf-8",
"table": ["${stable}"]
}
],

上述例子可以在linux下与shell进行嵌套使用。

3.2 mysql -> hdfs

示例一:全量导

# 1. 查看配置模板
python datax.py -r mysqlreader -w hdfswriter # 2. 创建和编辑配置文件
vim custom/mysql2hdfs.json
{
"job":{
"setting":{
"speed":{
"channel":1
}
},
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"username":"xxx",
"password":"xxx",
"column":["id","name","age","birthday"],
"connection":[
{
"table":[
"tt_user"
],
"jdbcUrl":[
"jdbc:mysql://192.168.1.96:3306/test"
]
}
]
}
},
"writer":{
"name":"hdfswriter",
"parameter":{
"defaultFS":"hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
"fileType":"text",
"path":"/tmp/test01",
"fileName":"tt_user",
"column":[
{"name":"id", "type":"INT"},
{"name":"name", "type":"VARCHAR"},
{"name":"age", "type":"INT"}
{"name":"birthday", "type":"date"}
],
"writeMode":"append",
"fieldDelimiter":"\t",
"compress":"GZIP"
}
}
}
]
}
} # 3. 启动导数进程
python datax.py custom/mysql2hdfs.json # 4. 日志结果
2018-11-23 14:37:58.056 [job-0] INFO JobContainer -
任务启动时刻 : 2018-11-23 14:37:45
任务结束时刻 : 2018-11-23 14:37:58
任务总计耗时 : 12s
任务平均流量 : 9B/s
记录写入速度 : 0rec/s
读出记录总数 : 7
读写失败总数 : 0

示例二:增量导(表切分)

{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "admin",
"password": "qweasd123",
"column": [
"id",
"name",
"age",
"birthday"
],
"splitPk": "id",
"where": "id<10",
"connection": [{
"table": [
"tt_user",
"ttt_user"
],
"jdbcUrl": [
"jdbc:mysql://hadoop01:3306/test"
]
}]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.minq-cluster": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType": "text",
"path": "/tmp/test/user",
"fileName": "mysql_test_user",
"column": [{
"name": "id",
"type": "INT"
},
{
"name": "name",
"type": "VARCHAR"
},
{
"name": "age",
"type": "INT"
},
{
"name": "birthday",
"type": "date"
}
],
"writeMode": "append",
"fieldDelimiter": "\t"
}
}
}]
}
}

注意:外域机器通信需要用外网ip,未配置hostname访问会访问异常。

可以通过配置 hdfs-site.xml 进行解决:

<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
<description>only cofig in clients</description>
</property>

或者通过配置java客户端:

Configuration conf=new Configuration();
conf.set("dfs.client.use.datanode.hostname", "true");

或者通过配置 datax 工作配置:

"hadoopConfig": {
"dfs.client.use.datanode.hostname":"true",
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.minq-cluster": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS00018:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS00019:8020",
"dfs.client.failover.proxy.provider.minq-cluster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}

这段对应源码中:

        hadoopConf = new org.apache.hadoop.conf.Configuration();
Configuration hadoopSiteParams = taskConfig.getConfiguration(Key.HADOOP_CONFIG);
JSONObject hadoopSiteParamsAsJsonObject = JSON.parseObject(taskConfig.getString(Key.HADOOP_CONFIG));
if (null != hadoopSiteParams) {
Set<String> paramKeys = hadoopSiteParams.getKeys();
for (String each : paramKeys) {
hadoopConf.set(each, hadoopSiteParamsAsJsonObject.getString(each));
}
}
hadoopConf.set(HDFS_DEFAULTFS_KEY, taskConfig.getString(Key.DEFAULT_FS));

示例三:增量导(sql查询)

mysql2hdfs-condition.json

{
"job": {
"setting": {
"speed": {
"channel":1
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "xxx",
"password": "xxx",
"connection": [
{
"querySql": [
"select id,name,age,birthday from tt_user where id <= 5"
],
"jdbcUrl": [
"jdbc:mysql://192.168.1.96:3306/test"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter":{
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
"fileType":"text",
"path":"/tmp/test01",
"fileName":"tt_user",
"column":[
{"name":"id", "type":"INT"},
{"name":"name", "type":"VARCHAR"},
{"name":"age", "type":"INT"}
{"name":"birthday", "type":"date"}
],
"writeMode":"append",
"fieldDelimiter":"\t"
}
}
}
]
}
}

hdfs -> mysql

# 1. 查看配置模板
python datax.py -r hdfsreader -w mysqlwriter # 2. 创建和编辑配置文件
vim custom/hdfs2mysql.json
{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [{
"index": "0",
"type": "long"
},
{
"index": "1",
"type": "string"
},
{
"index": "2",
"type": "long"
},
{
"index": "3",
"type": "date"
}
],
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"encoding": "UTF-8",
"fileType": "text",
"path": "/tmp/test/tt_user*",
"fieldDelimiter": "\t"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"id",
"name",
"age",
"birthday"
],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.96:3306/test",
"table": ["ttt_user"]
}],
"username": "zhangqingli",
"password": "xxx",
"preSql": [
"select * from ttt_user",
"select name from ttt_user"
],
"session": [
"set session sql_mode='ANSI'"
],
"writeMode": "insert"
}
}
}]
}
} # 3. 启动导数进程
python datax.py custom/hdfs2mysql.json # 4. 日志结果
任务启动时刻 : 2018-11-23 14:44:54
任务结束时刻 : 2018-11-23 14:45:06
任务总计耗时 : 12s
任务平均流量 : 9B/s
记录写入速度 : 0rec/s
读出记录总数 : 7
读写失败总数 : 0

mongo -> hdfs

示例一:全量导

{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [{
"reader": {
"name": "mongodbreader",
"parameter": {
"address": ["192.168.1.96:27017"],
"userName": "xxxx",
"userPassword": "xxxx",
"dbName": "test",
"collectionName": "student",
"column": [
{"name": "_id", "type": "string"},
{"name": "name", "type": "string"},
{"name": "age", "type": "double"},
{"name": "clazz", "type": "double"},
{"name": "hobbies", "type": "Array"},
{"name": "ss", "type": "Array"}
],
"splitter": ","
}
},
"writer": {
"name": "hdfswriter",
"parameter":{
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType":"text",
"path":"/tmp/test01",
"fileName":"mongo_student",
"column":[
{"name": "_id", "type": "string"},
{"name": "name", "type": "string"},
{"name": "age", "type": "double"},
{"name": "clazz", "type": "double"},
{"name": "hobbies", "type": "string"},
{"name": "ss", "type": "string"}
],
"writeMode":"append",
"fieldDelimiter":"\u0001"
}
}
}]
}
}

示例二:mongo增量导

{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "mongodbreader",
"parameter": {
"address": ["地址"],
"userName": "用户名",
"userPassword": "密码",
"dbName": "库名",
"collectionName": "集合名",
"query":"{created:{ $gte: ISODate('1990-01-01T16:00:00.000Z'), $lte: ISODate('2010-01-01T16:00:00.000Z') }}",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "owner", "type": "string" },
{ "name": "contributor", "type": "string" },
{ "name": "type", "type": "string" },
{ "name": "amount", "type": "int" },
{ "name": "divided", "type": "double" },
{ "name": "orderId", "type": "string" },
{ "name": "orderPrice", "type": "int" },
{ "name": "created", "type": "date" },
{ "name": "updated", "type": "date" },
{ "name": "hobbies", "type": "Array"}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType": "text",
"path": "/user/hive/warehouse/aries.db/ods_goldsystem_mdaccountitems/accounting_day=$dt",
"fileName": "filenamexxx",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "owner", "type": "string" },
{ "name": "contributor", "type": "string" },
{ "name": "type", "type": "string" },
{ "name": "amount", "type": "int" },
{ "name": "divided", "type": "double" },
{ "name": "orderId", "type": "string" },
{ "name": "orderPrice", "type": "int" },
{ "name": "created", "type": "date" },
{ "name": "updated", "type": "date" },
{ "name": "hobbies", "type": "string"}
],
"writeMode": "append",
"fieldDelimiter": "\t"
}
}
}]
} }

hdfs -> mongo

{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [
{ "index": 0, "type": "String" },
{ "index": 1, "type": "String" },
{ "index": 2, "type": "Long" },
{ "index": 3, "type": "Date" }
],
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"encoding": "UTF-8",
"fieldDelimiter": "\t",
"fileType": "text",
"path": "/tmp/test/mongo_student*"
}
},
"writer": {
"name": "mongodbwriter",
"parameter": {
"address": [
"192.168.1.96:27017"
],
"userName": "test",
"userPassword": "xxx",
"dbName": "test",
"collectionName": "student_from_hdfs",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "birthday", "type": "date" }
],
"splitter": ",",
"upsertInfo": {
"isUpsert": "true",
"upsertKey": "_id"
}
}
}
}]
}
}

dataX调优的更多相关文章

  1. Impala 架构探索-Impala 系统组成与使用调优

    要好好使用 Impala 就得好好梳理一下他得结构以及他存在得一些问题或者需要注意得地方.本系列博客主要想记录一下对 Impala 架构梳理以及使用上的 workaround. Impala 简介 首 ...

  2. 46张PPT讲述JVM体系结构、GC算法和调优

    本PPT从JVM体系结构概述.GC算法.Hotspot内存管理.Hotspot垃圾回收器.调优和监控工具六大方面进行讲述.(内嵌iframe,建议使用电脑浏览) 好东西当然要分享,PPT已上传可供下载 ...

  3. 《深入理解Java虚拟机》调优案例分析与实战

    上节学习回顾 在上一节当中,主要学习了Sun JDK的一些命令行和可视化性能监控工具的具体使用,但性能分析的重点还是在解决问题的思路上面,没有好的思路,再好的工具也无补于事. 本节学习重点 在书本上本 ...

  4. Spark Shuffle原理、Shuffle操作问题解决和参数调优

    摘要: 1 shuffle原理 1.1 mapreduce的shuffle原理 1.1.1 map task端操作 1.1.2 reduce task端操作 1.2 spark现在的SortShuff ...

  5. 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优

    摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...

  6. jvm系列(四):jvm调优-命令大全(jps jstat jmap jhat jstack jinfo)

    文章同步发布于github博客地址,阅读效果更佳,欢迎品尝 运用jvm自带的命令可以方便的在生产监控和打印堆栈的日志信息帮忙我们来定位问题!虽然jvm调优成熟的工具已经有很多:jconsole.大名鼎 ...

  7. jvm系列(六):jvm调优-从eclipse开始

    jvm调优-从eclipse开始 概述 什么是jvm调优呢?jvm调优就是根据gc日志分析jvm内存分配.回收的情况来调整各区域内存比例或者gc回收的策略:更深一层就是根据dump出来的内存结构和线程 ...

  8. web前端性能调优

    最近2个月一直在做手机端和电视端开发,开发的过程遇到过各种坑.弄到快元旦了,终于把上线了.2个月干下来满满的的辛苦,没有那么忙了自己准备把前端的性能调优总结以下,以方便以后自己再次使用到的时候得于得心 ...

  9. JVM调优总结

    堆大小设置JVM 中最大堆大小有三方面限制:相关操作系统的数据模型(32-bt还是64-bit)限制:系统的可用虚拟内存限制:系统的可用物理内存限制.32位系统下,一般限制在1.5G~2G:64为操作 ...

随机推荐

  1. Python接口开发

    一.flask flask是一个python编写的轻量级框架,可以使用它实现一个网站.web服务. 用flask开发接口的流程为: 1.定义一个server server=flask.Flask(__ ...

  2. 矩阵快速幂(Matrix_Fast_Power)

    一.基础知识(1)矩阵乘法 https://blog.csdn.net/weixin_43272781/article/details/82899737 简单的说矩阵就是二维数组,数存在里面,矩阵乘法 ...

  3. linux增加swap大小

    参考自:https://blog.csdn.net/ssrmygod/article/details/70157716 我在centos6.5上照着操作成功了首先查一下目前swap的大小: [root ...

  4. luogu题解 P1462 【通往奥格瑞玛的道路】二分+spfa

    题目链接: https://www.luogu.org/problemnew/show/P1462 思路: 又是一道水题,很明显二分+最短路 而且这道题数据非常水,spfa有个小错误居然拿了91分还比 ...

  5. js文件的框架

    Ext.define("BeidaSoft.SFJCGL.rcjwgl.bdgl.BdglGrid", { extend : "BeidaSoft.XTGL.base.Q ...

  6. mysql中页的组成

    页InnoDB采取的方式是:将数据划分为若干个页,以页作为磁盘和内存之间交互的基本单位,InnoDB中页的大小一般为 16 KB.也就是在一般情况下,一次最少从磁盘中读取16KB的内容到内存中,一次最 ...

  7. Ubuntu18.10中pip install mysqlclient 出现EnvironmentError: mysql_config not found错误

    Complete output from command python setup.py egg_info: sh: 1: mysql_config: not found Traceback (mos ...

  8. MySQL select之后再update

    1.先查询页面 而后再根据查询的结果来更改数据库,可以使用SELECT …… FOR UPDATE 来实现,具体的代码如下 SELECT * FROM chat //查询的表 //查询的条件 FOR ...

  9. 【异常】org.apache.phoenix.exception.PhoenixIOException: SYSTEM:CATALOG

    1 详细异常信息 rror: SYSTEM:CATALOG (state=,code=) org.apache.phoenix.exception.PhoenixIOException: SYSTEM ...

  10. Ubuntu中用sudo apt-get install makeinfo时,出错:Unable to locate package

    背景: 在准备ARM交叉编译环境时,执行命令: DISTRO=fsl-imx-x11 MACHINE=imx6qsabresd source fsl-setup-release.sh -b build ...