dataX调优

标签(空格分隔): ETL


一,Datax调优方向

DataX调优要分成几个部分(注:此处任务机指运行Datax任务所在的机器)。

1,网络本身的带宽等硬件因素造成的影响;

2,DataX本身的参数;

3,从源端到任务机;

4,从任务机到目的端;

即当觉得DataX传输速度慢时,需要从上述四个方面着手开始排查。

1,网络带宽等硬件因素调优

此部分主要需要了解网络本身的情况,即从源端到目的端的带宽是多少(实际带宽计算公式),平时使用量和繁忙程度的情况,从而分析是否是本部分造成的速度缓慢。以下提供几个思路。

1,可使用从源端到目的端scp,python http,nethogs等观察实际网络及网卡速度;

2,结合监控观察任务运行时间段时,网络整体的繁忙情况,来判断是否应将任务避开网络高峰运行;

3,观察任务机的负载情况,尤其是网络和磁盘IO,观察其是否成为瓶颈,影响了速度;

2,DataX本身的参数调优

全局

{
"core":{
"transport":{
"channel":{
"speed":{
"channel": 2, ## 此处为数据导入的并发度,建议根据服务器硬件进行调优
"record":-1, ##此处解除对读取行数的限制
"byte":-1, ##此处解除对字节的限制
"batchSize":2048 ##每次读取batch的大小
}
}
}
},
"job":{
...
}
}

局部

"setting": {
"speed": {
"channel": 2,
"record":-1,
"byte":-1,
"batchSize":2048
}
}
}
} # channel增大,为防止OOM,需要修改datax工具的datax.py文件。
# 如下所示,可根据任务机的实际配置,提升-Xms与-Xmx,来防止OOM。
# tunnel并不是越大越好,过分大反而会影响宿主机的性能。
DEFAULT_JVM = "-Xms1g -Xmx1g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=%s/log" % (DATAX_HOME)

Jvm 调优

python datax.py  --jvm="-Xms3G -Xmx3G" ../job/test.json

此处根据服务器配置进行调优,切记不可太大!否则直接Exception

以上为调优,应该是可以针对每个json文件都可以进行调优。

3,功能测试和性能测试

quick start https://github.com/alibaba/DataX/blob/master/userGuid.md

3.1 动态传参

如果需要导入数据的表太多而表的格式又相同,可以进行json文件的复用,举个简单的例子: python datax.py -p “-Dsdbname=test -Dstable=test” ../job/test.json

"column": ["*"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://xxx:xx/${sdbname}?characterEncoding=utf-8",
"table": ["${stable}"]
}
],

上述例子可以在linux下与shell进行嵌套使用。

3.2 mysql -> hdfs

示例一:全量导

# 1. 查看配置模板
python datax.py -r mysqlreader -w hdfswriter # 2. 创建和编辑配置文件
vim custom/mysql2hdfs.json
{
"job":{
"setting":{
"speed":{
"channel":1
}
},
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"username":"xxx",
"password":"xxx",
"column":["id","name","age","birthday"],
"connection":[
{
"table":[
"tt_user"
],
"jdbcUrl":[
"jdbc:mysql://192.168.1.96:3306/test"
]
}
]
}
},
"writer":{
"name":"hdfswriter",
"parameter":{
"defaultFS":"hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
"fileType":"text",
"path":"/tmp/test01",
"fileName":"tt_user",
"column":[
{"name":"id", "type":"INT"},
{"name":"name", "type":"VARCHAR"},
{"name":"age", "type":"INT"}
{"name":"birthday", "type":"date"}
],
"writeMode":"append",
"fieldDelimiter":"\t",
"compress":"GZIP"
}
}
}
]
}
} # 3. 启动导数进程
python datax.py custom/mysql2hdfs.json # 4. 日志结果
2018-11-23 14:37:58.056 [job-0] INFO JobContainer -
任务启动时刻 : 2018-11-23 14:37:45
任务结束时刻 : 2018-11-23 14:37:58
任务总计耗时 : 12s
任务平均流量 : 9B/s
记录写入速度 : 0rec/s
读出记录总数 : 7
读写失败总数 : 0

示例二:增量导(表切分)

{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "admin",
"password": "qweasd123",
"column": [
"id",
"name",
"age",
"birthday"
],
"splitPk": "id",
"where": "id<10",
"connection": [{
"table": [
"tt_user",
"ttt_user"
],
"jdbcUrl": [
"jdbc:mysql://hadoop01:3306/test"
]
}]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.minq-cluster": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType": "text",
"path": "/tmp/test/user",
"fileName": "mysql_test_user",
"column": [{
"name": "id",
"type": "INT"
},
{
"name": "name",
"type": "VARCHAR"
},
{
"name": "age",
"type": "INT"
},
{
"name": "birthday",
"type": "date"
}
],
"writeMode": "append",
"fieldDelimiter": "\t"
}
}
}]
}
}

注意:外域机器通信需要用外网ip,未配置hostname访问会访问异常。

可以通过配置 hdfs-site.xml 进行解决:

<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
<description>only cofig in clients</description>
</property>

或者通过配置java客户端:

Configuration conf=new Configuration();
conf.set("dfs.client.use.datanode.hostname", "true");

或者通过配置 datax 工作配置:

"hadoopConfig": {
"dfs.client.use.datanode.hostname":"true",
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.minq-cluster": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS00018:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS00019:8020",
"dfs.client.failover.proxy.provider.minq-cluster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}

这段对应源码中:

        hadoopConf = new org.apache.hadoop.conf.Configuration();
Configuration hadoopSiteParams = taskConfig.getConfiguration(Key.HADOOP_CONFIG);
JSONObject hadoopSiteParamsAsJsonObject = JSON.parseObject(taskConfig.getString(Key.HADOOP_CONFIG));
if (null != hadoopSiteParams) {
Set<String> paramKeys = hadoopSiteParams.getKeys();
for (String each : paramKeys) {
hadoopConf.set(each, hadoopSiteParamsAsJsonObject.getString(each));
}
}
hadoopConf.set(HDFS_DEFAULTFS_KEY, taskConfig.getString(Key.DEFAULT_FS));

示例三:增量导(sql查询)

mysql2hdfs-condition.json

{
"job": {
"setting": {
"speed": {
"channel":1
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "xxx",
"password": "xxx",
"connection": [
{
"querySql": [
"select id,name,age,birthday from tt_user where id <= 5"
],
"jdbcUrl": [
"jdbc:mysql://192.168.1.96:3306/test"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter":{
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
"fileType":"text",
"path":"/tmp/test01",
"fileName":"tt_user",
"column":[
{"name":"id", "type":"INT"},
{"name":"name", "type":"VARCHAR"},
{"name":"age", "type":"INT"}
{"name":"birthday", "type":"date"}
],
"writeMode":"append",
"fieldDelimiter":"\t"
}
}
}
]
}
}

hdfs -> mysql

# 1. 查看配置模板
python datax.py -r hdfsreader -w mysqlwriter # 2. 创建和编辑配置文件
vim custom/hdfs2mysql.json
{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [{
"index": "0",
"type": "long"
},
{
"index": "1",
"type": "string"
},
{
"index": "2",
"type": "long"
},
{
"index": "3",
"type": "date"
}
],
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"encoding": "UTF-8",
"fileType": "text",
"path": "/tmp/test/tt_user*",
"fieldDelimiter": "\t"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"id",
"name",
"age",
"birthday"
],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.96:3306/test",
"table": ["ttt_user"]
}],
"username": "zhangqingli",
"password": "xxx",
"preSql": [
"select * from ttt_user",
"select name from ttt_user"
],
"session": [
"set session sql_mode='ANSI'"
],
"writeMode": "insert"
}
}
}]
}
} # 3. 启动导数进程
python datax.py custom/hdfs2mysql.json # 4. 日志结果
任务启动时刻 : 2018-11-23 14:44:54
任务结束时刻 : 2018-11-23 14:45:06
任务总计耗时 : 12s
任务平均流量 : 9B/s
记录写入速度 : 0rec/s
读出记录总数 : 7
读写失败总数 : 0

mongo -> hdfs

示例一:全量导

{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [{
"reader": {
"name": "mongodbreader",
"parameter": {
"address": ["192.168.1.96:27017"],
"userName": "xxxx",
"userPassword": "xxxx",
"dbName": "test",
"collectionName": "student",
"column": [
{"name": "_id", "type": "string"},
{"name": "name", "type": "string"},
{"name": "age", "type": "double"},
{"name": "clazz", "type": "double"},
{"name": "hobbies", "type": "Array"},
{"name": "ss", "type": "Array"}
],
"splitter": ","
}
},
"writer": {
"name": "hdfswriter",
"parameter":{
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType":"text",
"path":"/tmp/test01",
"fileName":"mongo_student",
"column":[
{"name": "_id", "type": "string"},
{"name": "name", "type": "string"},
{"name": "age", "type": "double"},
{"name": "clazz", "type": "double"},
{"name": "hobbies", "type": "string"},
{"name": "ss", "type": "string"}
],
"writeMode":"append",
"fieldDelimiter":"\u0001"
}
}
}]
}
}

示例二:mongo增量导

{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "mongodbreader",
"parameter": {
"address": ["地址"],
"userName": "用户名",
"userPassword": "密码",
"dbName": "库名",
"collectionName": "集合名",
"query":"{created:{ $gte: ISODate('1990-01-01T16:00:00.000Z'), $lte: ISODate('2010-01-01T16:00:00.000Z') }}",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "owner", "type": "string" },
{ "name": "contributor", "type": "string" },
{ "name": "type", "type": "string" },
{ "name": "amount", "type": "int" },
{ "name": "divided", "type": "double" },
{ "name": "orderId", "type": "string" },
{ "name": "orderPrice", "type": "int" },
{ "name": "created", "type": "date" },
{ "name": "updated", "type": "date" },
{ "name": "hobbies", "type": "Array"}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType": "text",
"path": "/user/hive/warehouse/aries.db/ods_goldsystem_mdaccountitems/accounting_day=$dt",
"fileName": "filenamexxx",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "owner", "type": "string" },
{ "name": "contributor", "type": "string" },
{ "name": "type", "type": "string" },
{ "name": "amount", "type": "int" },
{ "name": "divided", "type": "double" },
{ "name": "orderId", "type": "string" },
{ "name": "orderPrice", "type": "int" },
{ "name": "created", "type": "date" },
{ "name": "updated", "type": "date" },
{ "name": "hobbies", "type": "string"}
],
"writeMode": "append",
"fieldDelimiter": "\t"
}
}
}]
} }

hdfs -> mongo

{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [
{ "index": 0, "type": "String" },
{ "index": 1, "type": "String" },
{ "index": 2, "type": "Long" },
{ "index": 3, "type": "Date" }
],
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"encoding": "UTF-8",
"fieldDelimiter": "\t",
"fileType": "text",
"path": "/tmp/test/mongo_student*"
}
},
"writer": {
"name": "mongodbwriter",
"parameter": {
"address": [
"192.168.1.96:27017"
],
"userName": "test",
"userPassword": "xxx",
"dbName": "test",
"collectionName": "student_from_hdfs",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "birthday", "type": "date" }
],
"splitter": ",",
"upsertInfo": {
"isUpsert": "true",
"upsertKey": "_id"
}
}
}
}]
}
}

dataX调优的更多相关文章

  1. Impala 架构探索-Impala 系统组成与使用调优

    要好好使用 Impala 就得好好梳理一下他得结构以及他存在得一些问题或者需要注意得地方.本系列博客主要想记录一下对 Impala 架构梳理以及使用上的 workaround. Impala 简介 首 ...

  2. 46张PPT讲述JVM体系结构、GC算法和调优

    本PPT从JVM体系结构概述.GC算法.Hotspot内存管理.Hotspot垃圾回收器.调优和监控工具六大方面进行讲述.(内嵌iframe,建议使用电脑浏览) 好东西当然要分享,PPT已上传可供下载 ...

  3. 《深入理解Java虚拟机》调优案例分析与实战

    上节学习回顾 在上一节当中,主要学习了Sun JDK的一些命令行和可视化性能监控工具的具体使用,但性能分析的重点还是在解决问题的思路上面,没有好的思路,再好的工具也无补于事. 本节学习重点 在书本上本 ...

  4. Spark Shuffle原理、Shuffle操作问题解决和参数调优

    摘要: 1 shuffle原理 1.1 mapreduce的shuffle原理 1.1.1 map task端操作 1.1.2 reduce task端操作 1.2 spark现在的SortShuff ...

  5. 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优

    摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...

  6. jvm系列(四):jvm调优-命令大全(jps jstat jmap jhat jstack jinfo)

    文章同步发布于github博客地址,阅读效果更佳,欢迎品尝 运用jvm自带的命令可以方便的在生产监控和打印堆栈的日志信息帮忙我们来定位问题!虽然jvm调优成熟的工具已经有很多:jconsole.大名鼎 ...

  7. jvm系列(六):jvm调优-从eclipse开始

    jvm调优-从eclipse开始 概述 什么是jvm调优呢?jvm调优就是根据gc日志分析jvm内存分配.回收的情况来调整各区域内存比例或者gc回收的策略:更深一层就是根据dump出来的内存结构和线程 ...

  8. web前端性能调优

    最近2个月一直在做手机端和电视端开发,开发的过程遇到过各种坑.弄到快元旦了,终于把上线了.2个月干下来满满的的辛苦,没有那么忙了自己准备把前端的性能调优总结以下,以方便以后自己再次使用到的时候得于得心 ...

  9. JVM调优总结

    堆大小设置JVM 中最大堆大小有三方面限制:相关操作系统的数据模型(32-bt还是64-bit)限制:系统的可用虚拟内存限制:系统的可用物理内存限制.32位系统下,一般限制在1.5G~2G:64为操作 ...

随机推荐

  1. 前端-CSS-初探-注释-语法结构-引入方式-选择器-选择器优先级-01(待完善)

    目录 CSS(Cascading Style Sheet) CSS注释 CSS语法结构 CSS的三种引入方式 选择器 伪类.伪元素选择器速查 CSS选择器优先级***** 选择器相同的情况下 选择器不 ...

  2. 剑指offer-数字在排序数组中出现的次数-数组-python

    题目描述 统计一个数字在排序数组中出现的次数.   python 内置函数 count()一行就能搞定   解题思路 二分查找到给定的数字及其坐标.以该坐标为中点,向前向后找到这个数字的 始 – 终 ...

  3. scrapy爬取猫眼电影排行榜

    做爬虫的人,一定离不开的一个框架就是scrapy框架,写小项目的时候可以用requests模块就能得到结果,但是当爬取的数据量大的时候,就一定要用到框架. 下面先练练手,用scrapy写一个爬取猫眼电 ...

  4. java实现spark常用算子之distinct

    import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.a ...

  5. CentOS7部署ntp服务器

    主机 角色 192.168.48.128 Server 192.168.48.129 Client 192.168.48.130 Client 所有主机安装ntp服务 yum install -y n ...

  6. MyCat配置简述以及mycat全局ID

    Mycat可以直接下载解压,简单配置后可以使用,主要配置项如下: 1. log4j2.xml:配置MyCat日志,包括位置,格式,单个文件大小 2. rule.xml: 配置分片规则 3. schem ...

  7. linux服务器硬件信息查看

    1.linux 查看服务器序列号(S/N) [root@oss20hb106 ~]# dmidecode -t 1 # dmidecode 2.11 # SMBIOS entry point at 0 ...

  8. Centos 7 Samba服务安装

    Centos 7 Samba服务安装搭建Samba服务器是为了实现Linux共享目录之后,在Windows可以直接访问该共享目录. 查看是已安装samba包: rpm -qa | grep samba ...

  9. Linux系统服务器 GNU Bash 环境变量远程命令执行漏洞修复命令

    具体方法就是在ssh上执行 yum update bash 完成后重启VPS.

  10. 有趣的动画swf 小鼠吃豆子

    今天发现一个有趣的动画swf,小鼠吃豆子,呵呵 <object width="240" height="206" data="http://cd ...