DataX操作指南
1.DataX介绍
DataX
DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。
Features
DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。
安装
Download DataX下载地址
解压后即可使用,运行脚本如下
python27 datax.py ..\job\test.json
2.DataX数据同步
2.1 从MySQL到MySQL
建表语句
DROP TABLE IF EXISTS `tb_dmp_requser`;
CREATE TABLE `tb_dmp_requser` (
`reqid` varchar() NOT NULL COMMENT '活动编号',
`exetype` varchar() DEFAULT NULL COMMENT '执行类型',
`allnum` varchar() DEFAULT NULL COMMENT '全部目标用户数量',
`exenum` varchar() DEFAULT NULL COMMENT '执行的目标用户数据',
`resv` varchar() DEFAULT NULL,
`createtime` datetime DEFAULT NULL
)
将dmp数据库的tb_dmp_requser表拷贝到dota2_databank的tb_dmp_requser表
job_mysql_to_mysql.json如下
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"allnum", "reqid"
],
"connection": [{
"jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/dmp"],
"table": ["tb_dmp_requser"]
}],
"password": "",
"username": "root"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"allnum", "reqid"
],
"preSql": [
"delete from tb_dmp_requser"
],
"connection": [{
"jdbcUrl": "jdbc:mysql://127.0.0.1:3306/dota2_databank",
"table": ["tb_dmp_requser"]
}],
"password": "",
"username": "root",
"writeMode": "replace"
}
}
}],
"setting": {
"speed": {
"channel": ""
}
}
}
}
2.2 从Oracle到Oracle
将scott用户下的test表拷贝到test用户下的test表
建表语句
drop table TEST; CREATE TABLE TEST (
ID NUMBER() NULL,
NAME VARCHAR2( BYTE) NULL
)
LOGGING
NOCOMPRESS
NOCACHE;
job_oracle_oracle.json
{
"job": {
"content": [
{
"reader": {
"name": "oraclereader",
"parameter": {
"column": ["id","name"],
"connection": [
{
"jdbcUrl": ["jdbc:oracle:thin:@localhost:1521:ORCL"],
"table": ["test"]
}
],
"password": "tiger",
"username": "scott",
"where":"rownum < 1000"
}
},
"writer": {
"name": "oraclewriter",
"parameter": {
"column": ["id","name"],
"connection": [
{
"jdbcUrl": "jdbc:oracle:thin:@localhost:1521:ORCL",
"table": ["test"]
}
],
"password": "test",
"username": "test"
}
}
}
],
"setting": {
"speed": {
"channel":
}
}
}
}
2.3 从HBase到本地
将HBase的"LXW"表拷贝到本地路径../job/datax_hbase
建表语句,添加两条数据
hbase(main)::> create 'LXW','CF'
row(s) in 1.2120 seconds => Hbase::Table - LXW
hbase(main)::> put 'LXW','row1','CF:NAME','lxw'
row(s) in 0.0120 seconds hbase(main)::> put 'LXW','row1','CF:AGE',''
row(s) in 0.0080 seconds hbase(main)::> put 'LXW','row1','CF:ADDRESS','BeijingYiZhuang'
row(s) in 0.0070 seconds hbase(main)::> put 'LXW','row2','CF:ADDRESS','BeijingYiZhuang2'
row(s) in 0.0060 seconds hbase(main)::> put 'LXW','row2','CF:AGE',''
row(s) in 0.0050 seconds hbase(main)::> put 'LXW','row2','CF:NAME','lxw2'
row(s) in 0.0040 seconds hbase(main)::> exit
job_hbase_to_local.json
hbase高可用集群配置参考https://www.cnblogs.com/Java-Starter/p/10756647.html
{
"job": {
"content": [
{
"reader": {
"name": "hbase11xreader",
"parameter": {
"hbaseConfig": {
"hbase.zookeeper.quorum": "CentOS7Five:2181,CentOS7Six:2181,CentOS7Seven:2181"
},
"table": "LXW",
"encoding": "utf-8",
"mode": "normal",
"column": [
{
"name":"rowkey",
"type":"string"
},
{
"name":"CF:NAME",
"type":"string"
},
{
"name":"CF:AGE",
"type":"string"
},
{
"name":"CF:ADDRESS",
"type":"string"
} ], "range": {
"endRowkey": "",
"isBinaryRowkey": false,
"startRowkey": ""
} }
},
"writer": {
"name": "txtfilewriter",
"parameter": {
"dateFormat": "yyyy-MM-dd",
"fieldDelimiter": "\t",
"fileName": "LXW",
"path": "../job/datax_hbase",
"writeMode": "truncate"
}
}
}
],
"setting": {
"speed": {
"channel":
}
}
}
}
在../job/datax_hbase路径下生成文件LXW__e647d969_d2c6_47ad_9534_15c90d696099
文件内容如下
row1 lxw BeijingYiZhuang
row2 lxw2 BeijingYiZhuang2
2.4 从本地到HBase
将本地文件导入到HBase的LXW表中
源数据source.txt
row3,jjj1,,BeijingYiZhuang3
row4,jjj2,,BeijingYiZhuang4
job_local_to_hbase.json
{
"job": {
"setting": {
"speed": {
"channel":
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": "../job/datax_hbase/source.txt",
"charset": "UTF-8",
"column": [
{
"index": ,
"type": "String"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
}
],
"fieldDelimiter": ","
}
},
"writer": {
"name": "hbase11xwriter",
"parameter": {
"hbaseConfig": {
"hbase.zookeeper.quorum": "CentOS7Five:2181,CentOS7Six:2181,CentOS7Seven:2181"
},
"table": "LXW",
"mode": "normal",
"rowkeyColumn": [
{
"index":,
"type":"string"
}
],
"column": [
{
"index":,
"name":"CF:NAME",
"type":"string"
},
{
"index":,
"name":"CF:AGE",
"type":"string"
},
{
"index":,
"name":"CF:ADDRESS",
"type":"string"
}
],
"versionColumn":{
"index": -,
"value":""
},
"encoding": "utf-8"
}
}
}
]
}
}
导入过后可以看到,新增的数据
hbase(main)::* get 'LXW','row3'
COLUMN CELL
CF:ADDRESS timestamp=, value=BeijingYiZhuang3
CF:AGE timestamp=, value=
CF:NAME timestamp=, value=jjj1
2.5 从本地到HDFS/Hive
HDFS导入到本地不支持高可用,所以这里不做实验
Hive高可用配置参考https://www.cnblogs.com/Java-Starter/p/10756528.html
将本地数据文件导入到HDFS/Hive,在Hive上建表才可以导入
因为路径的问题,只能在Linux端操作
源数据source.txt
,,,
,,,
建表语句
create table datax_test(
col1 varchar(),
col2 varchar(),
col3 varchar(),
col4 varchar()
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC;
fileType要orc,text类型必须要压缩,有可能乱码
job_local_to_hdfs.json
{
"setting": {},
"job": {
"setting": {
"speed": {
"channel":
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["../job/datax_hbase/source.txt"],
"encoding": "UTF-8",
"column": [
{
"index": ,
"type": "String"
},
{
"index": ,
"type": "String"
},
{
"index": ,
"type": "String"
},
{
"index": ,
"type": "String"
}
],
"fieldDelimiter": ","
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://ns1/",
"hadoopConfig":{
"dfs.nameservices": "ns1",
"dfs.ha.namenodes.ns1": "nn1,nn2",
"dfs.namenode.rpc-address.ns1.nn1": "CentOS7One:9000",
"dfs.namenode.rpc-address.ns1.nn2": "CentOS7Two:9000",
"dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType": "orc",
"path": "/user/hive/warehouse/datax_test",
"fileName": "datax_test",
"column": [
{
"name": "col1",
"type": "VARCHAR"
},
{
"name": "col2",
"type": "VARCHAR"
},
{
"name": "col3",
"type": "VARCHAR"
},
{
"name": "col4",
"type": "VARCHAR"
}
],
"writeMode": "append",
"fieldDelimiter": ",",
"compress":"NONE"
}
}
}
]
}
}
导入完毕,查看hive
Time taken: 0.105 seconds
hive>
>
>
> select *from datax_test;
OK Time taken: 0.085 seconds, Fetched: row(s)
2.6 从txt到oracle
txt,dat,csv等格式均可,该dat文件16G,一亿八千万条记录。
建表语句
CREATE TABLE T_CJYX_HOMECOUNT (
"ACYC_ID" VARCHAR2( BYTE) NULL ,
"ADDRESS_ID" VARCHAR2( BYTE) NULL ,
"ADDRESS_NAME" VARCHAR2( BYTE) NULL ,
"ADDRESS_LEVEL" VARCHAR2( BYTE) NULL ,
"CHECK_TARGET_NUM" VARCHAR2( BYTE) NULL ,
"CHECK_VALUE" VARCHAR2( BYTE) NULL ,
"TARGET_PHONE" VARCHAR2( BYTE) NULL ,
"NOTARGET_PHONE" VARCHAR2( BYTE) NULL ,
"PARENT_ID" VARCHAR2( BYTE) NULL ,
"BCYC_ID" VARCHAR2( BYTE) NULL
)
job_txt_to_oracle.json文件如下
{
"setting": {},
"job": {
"setting": {
"speed": {
"channel":
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["E:/opt/srcbigdata2/di_00121_20190427.dat"],
"encoding": "UTF-8",
"nullFormat": "",
"column": [
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
}, ],
"fieldDelimiter": "$"
}
},
"writer": {
"name": "oraclewriter",
"parameter": {
"column": ["acyc_id","address_id","address_name","address_level","check_target_num","check_value","target_phone","notarget_phone","parent_id","bcyc_id"],
"connection": [
{
"jdbcUrl": "jdbc:oracle:thin:@localhost:1521:ORCL",
"table": ["T_CJYX_HOMECOUNT"]
}
],
"password": "test",
"username": "test"
}
}
}
]
}
}
脚本
python27 datax.py ../job/job_txt_to_oracle.json
效率比oracle自带的sqlldr快很多,只需要117分钟,就导入了一亿八千万数据,sqlldr需要41小时。
2.7 从txt到txt
job_txt_to_txt.json如下
{
"setting": {},
"job": {
"setting": {
"speed": {
"channel":
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["../job/data_txt/a.txt"],
"encoding": "UTF-8",
"column": [
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
}, ],
"fieldDelimiter": "$"
}
},
"writer": {
"name": "txtfilewriter",
"parameter": {
"path": "../job/data_txt/",
"fileName": "luohw",
"writeMode": "truncate",
"format": "yyyy-MM-dd"
}
}
}
]
}
}
导入完毕生成文件如下
DataX操作指南的更多相关文章
- 【项目管理】GitHub使用操作指南
GitHub使用操作指南 作者:白宁超 2016年10月5日18:51:03> 摘要:GitHub的是版本控制和协作代码托管平台,它可以让你和其他人的项目从任何地方合作.相对于CVS和SVN的联 ...
- datax+hadoop2.X兼容性调试
以hdfsreader到hdfswriter为例进行说明: 1.datax的任务配置文件里需要指明使用的hadoop的配置文件,在datax+hadoop1.X的时候,可以直接使用hadoop1.X/ ...
- Tourist.js – 简单灵活的操作指南和导航插件
Tourist.js 是一个基于 Backbone 和 jQuery 开发的轻量库,帮助你在应用程序创建简单易用的操作指南和导航功能.相比网站,它更适合用于复杂的,单页网站类型的应用程序.Touris ...
- HHKB MAC 配置指南 操作指南 快捷键
1. 设备: mac电脑一台.hhkb键盘一个 2. 初级配置 (1)调节hhkb的模式为Macintosh模式:011001 (打开键盘侧边的滑盖,按照这个顺序调正) (2)Mac电脑安装官方驱动 ...
- 比较详细Python正则表达式操作指南(re使用)
比较详细Python正则表达式操作指南(re使用) Python 自1.5版本起增加了re 模块,它提供 Perl 风格的正则表达式模式.Python 1.5之前版本则是通过 regex 模块提供 E ...
- [推荐]DataX、DbSync和Timetunnel学习贴
[推荐]DataX.DbSync和Timetunnel学习贴 一 DataX 二 DbSync 三 Timetunnel TimeTunnel :http://code.taobao.org/p/T ...
- 关于sqoop与datax。 和sqoop to oracle插件OraOop
之前我还在想了解下datax,是否有可能替换sqoop,但了解后发现,datax和sqoop的业务场景是不同的.前者适合异构数据库的同步,后者适合hdfs与rdbms互相之间的同步.针对sq ...
- WEBUS2.0 In Action - 索引操作指南(2)
上一篇:WEBUS2.0 In Action - 索引操作指南(1) | 下一篇:WEBUS2.0 In Action - 搜索操作指南(1) 3. 添加.删除.撤销删除和修改文档 在WEBUS中要将 ...
- WEBUS2.0 In Action - 搜索操作指南 - (1)
上一篇:WEBUS2.0 In Action - 索引操作指南(2) | 下一篇:WEBUS2.0 In Action - 搜索操作指南(2) 1. IQueriable中内置的搜索功能 在Webus ...
随机推荐
- 一个服务器的Apache2.4.6配置多个域名
进入到Apache的配置文件:cd /etc/httpd/conf/http.conf 在后面添加: <VirtualHost *:80> # This first-listed virt ...
- LeetCode---Backtracking && DP
**322. Coin Change 思路:动态规划,构造一个数组,存入当前index最少需要多少个coin public int coinChange(int[] coins, int amount ...
- rabbitMq实战使用
只做下工作记录,比较重要的几个属性: concurrency:一个生产者可以同时由多少个消费者消费,这个一般根据你的机器性能来进行配置 prefetch:允许为每个consumer指定最大的unack ...
- HBuilder开发MUI web app溢出页面上下无法滚动问题
因为没有对页面初始化,所以页面溢出部分不会显示,要解决此问题需要加上下面代码: JS代码: (function($){$(".mui-scroll-wrapper").scroll ...
- 微信小程序之地址联动
这就是我们要实现的效果 <view class="consignee"> <!-- consignee 收件人 --> <text>收件人: & ...
- 【导航】JennyHui 老白兔记录贴
英语控 TED X - > 笔记 程序媛 2019-08-24 Java学习路径规划 思考记录 2018-08-24 常见的工作思考方式 浪费时间 百家讲坛 开卷八分钟
- 六十九:flask上下文之线程隔离的g对象的使用
保存全局对象的g对象g对象是在整个flask应用运行期间都是可以使用的,并且也是和request一样,是线程隔离的,这个对象是专门用来存放开发者自己定义的一些数据,方便在整个flask程序中都可以使用 ...
- strip使用
strip作用:去掉空格.以及想要去掉的字符,实例如下: In [42]: import subprocess In [42]: output=subprocess.check_output([&qu ...
- JS事件中级 --- 拖拽
http://bbs.zhinengshe.com/thread-1200-1-1.html 要求:实现div块的拖拽 原理:拖拽过程中鼠标点和div块的相对位置保持不变. 需要理解三点: 1. 为什 ...
- linux下解决80端口被占用
安装一个nginx服务,在启动的时候报80端口被占用了,我们来检查一下有哪些服务占用了80端口 首先我们查一下占用80端口的有哪些服务,netstat -lnp|grep 80 查看80端口被那些服务 ...