关系数据库数据与hadoop数据进行转换的工具 - Sqoop
Sqoop
本文所使用的Sqoop版本为1.4.6
1.官网
2.作用
A:可以把hadoop数据导入到关系数据库里面(e.g. Hive -> Mysql)
B:可以把关系数据库数据导入到hadoop里面(e.g. Mysql -> Hive)
3.下载
http://archive.apache.org/dist/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-1.0.0.tar.gz
4.安装
--上传到node1(我之前安装的hive就在node1上面)本目录,并且解压
cd
tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz --创建软链
ln -sf /root/sqoop-1.4.6.bin__hadoop-2.0.4-alpha /home/sqoop-1.4.6
5.环境变量配置
--配置环境变量
vi /etc/profile export HADOOP_PREFIX=$HADOOP_HOME
export PATH=$PATH:$SQOOP_HOME/bin :wq source /etc/profile
6.修改配置文件
--修改配置文件
cd /home/sqoop-1.4.6/conf/ cp sqoop-env-template.sh sqoop-env.sh vi sqoop-env.sh
7.添加驱动包
--把mysql驱动包添加到sqoop的lib目录下面
cd
scp mysql-connector-java-5.1.23-bin.jar /home/sqoop-1.4.6/lib/
8.测试
sqoop version [root@node1 ~]# sqoop version
Warning: /home/sqoop-1.4.6/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/sqoop-1.4.6/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/sqoop-1.4.6/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/sqoop-1.4.6/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/01/04 23:15:15 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Sqoop 1.4.6
git commit id c0c5a81723759fa575844a0a1eae8f510fa32c25
Compiled by root on Mon Apr 27 14:38:36 CST 2015 sqoop list-databases -connect jdbc:mysql://node1:3306/ -username root -password '!QAZ2wsx3edc' [root@node1 ~]# sqoop list-databases -connect jdbc:mysql://node1:3306/ -username root -password '!QAZ2wsx3edc'
Warning: /home/sqoop-1.4.6/../hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /home/sqoop-1.4.6/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /home/sqoop-1.4.6/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /home/sqoop-1.4.6/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
19/01/04 23:17:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
19/01/04 23:17:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/01/04 23:17:49 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
hive
mysql
performance_schema
result_db
spark
sys
你会发现,在输出里面会有很多Warning
我们可以通过以下操作去掉这些Warning
--去除Warning
cd /home/sqoop-1.4.6/bin/ vi configure-sqoop --把下面的行全部注释掉 - 在每一行前面加 '#'
## Moved to be a runtime check in sqoop.
#if [ ! -d "${HBASE_HOME}" ]; then
# echo "Warning: $HBASE_HOME does not exist! HBase imports will fail."
# echo 'Please set $HBASE_HOME to the root of your HBase installation.'
#fi ## Moved to be a runtime check in sqoop.
#if [ ! -d "${HCAT_HOME}" ]; then
# echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail."
# echo 'Please set $HCAT_HOME to the root of your HCatalog installation.'
#fi #if [ ! -d "${ACCUMULO_HOME}" ]; then
# echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
# echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
#fi
#if [ ! -d "${ZOOKEEPER_HOME}" ]; then
# echo "Warning: $ZOOKEEPER_HOME does not exist! Accumulo imports will fail."
# echo 'Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.'
#fi :wq --再次测试
[root@node1 bin]# sqoop list-databases -connect jdbc:mysql://node1:3306/ -username root -password '!QAZ2wsx3edc'
19/01/04 23:34:21 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
19/01/04 23:34:21 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/01/04 23:34:21 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
hive
mysql
performance_schema
result_db
spark
sys
9.应用
9.1.从Mysql导入到HDFS
准备工作:
--在数据库里面先创建table
CREATE TABLE t_user (id INT, name VARCHAR(20), age INT); --插入测试数据
insert into t_user values(1, 'Tom', 20);
insert into t_user values(2, 'John', 18);
insert into t_user values(3, 'Div', 25);
insert into t_user values(4, 'Susan', 31);
insert into t_user values(5, 'Tiran', 40);
insert into t_user values(6, 'Shasita', 13);
查询结果:
mysql> select * from t_user;
+------+---------+------+
| id | name | age |
+------+---------+------+
| 1 | Tom | 20 |
| 2 | John | 18 |
| 3 | Div | 25 |
| 4 | Susan | 31 |
| 5 | Tiran | 40 |
| 6 | Shasita | 13 |
+------+---------+------+
6 rows in set (0.00 sec)
--从mysql数据库里面导出数据到Hdfs上面
sqoop import --connect jdbc:mysql://node1:3306/sqoop_db --username root --password '!QAZ2wsx3edc' --table t_user --columns id,name,age -m 1 --target-dir /sqoop_t_user [root@node1 bin]# sqoop import --connect jdbc:mysql://node1:3306/sqoop_db --username root --password '!QAZ2wsx3edc' --table t_user --columns id,name,age -m 1 --target-dir /sqoop_t_user
19/01/04 23:54:30 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
19/01/04 23:54:30 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/01/04 23:54:30 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/01/04 23:54:30 INFO tool.CodeGenTool: Beginning code generation
19/01/04 23:54:31 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_user` AS t LIMIT 1
19/01/04 23:54:31 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t_user` AS t LIMIT 1
19/01/04 23:54:31 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoop-2.5
Note: /tmp/sqoop-root/compile/84e97965496cc61c73c17151375a419b/t_user.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/01/04 23:54:33 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/84e97965496cc61c73c17151375a419b/t_user.jar
19/01/04 23:54:33 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/01/04 23:54:33 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/01/04 23:54:33 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/01/04 23:54:33 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/01/04 23:54:33 INFO mapreduce.ImportJobBase: Beginning import of t_user
19/01/04 23:54:33 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/01/04 23:54:34 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/01/04 23:54:34 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.79.138:8032
19/01/04 23:54:48 INFO db.DBInputFormat: Using read commited transaction isolation
19/01/04 23:54:48 INFO mapreduce.JobSubmitter: number of splits:1
19/01/04 23:54:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1546674829746_0001
19/01/04 23:54:49 INFO impl.YarnClientImpl: Submitted application application_1546674829746_0001
19/01/04 23:54:49 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1546674829746_0001/
19/01/04 23:54:49 INFO mapreduce.Job: Running job: job_1546674829746_0001
19/01/04 23:54:59 INFO mapreduce.Job: Job job_1546674829746_0001 running in uber mode : false
19/01/04 23:54:59 INFO mapreduce.Job: map 0% reduce 0%
19/01/04 23:55:06 INFO mapreduce.Job: map 100% reduce 0%
19/01/04 23:55:06 INFO mapreduce.Job: Job job_1546674829746_0001 completed successfully
19/01/04 23:55:06 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=116299
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=63
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=4153
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=4153
Total vcore-seconds taken by all map tasks=4153
Total megabyte-seconds taken by all map tasks=4252672
Map-Reduce Framework
Map input records=6
Map output records=6
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=69
CPU time spent (ms)=1170
Physical memory (bytes) snapshot=175808512
Virtual memory (bytes) snapshot=893071360
Total committed heap usage (bytes)=84934656
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=63
19/01/04 23:55:06 INFO mapreduce.ImportJobBase: Transferred 63 bytes in 32.3608 seconds (1.9468 bytes/sec)
19/01/04 23:55:06 INFO mapreduce.ImportJobBase: Retrieved 6 records.
运行效果:
官网提供另一种方式,即读取文件的方式来实现上面的导入功能
cd
mkdir mysqoopdir
cd mysqoopdir vi mysql_to_hdfs import
--connect
jdbc:mysql://node1:3306/sqoop_db
--username
root
--password
'!QAZ2wsx3edc'
--table
t_user
--columns
id,name,age
-m
1
--target-dir
/sqoop_t_user
--delete-target-dir :wq sqoop --options-file mysql_to_hdfs
我们可以通过Hive,来验证导入结果
[root@node1 bin]# ./hive
19/01/05 00:03:29 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect. Use hive.hmshandler.retry.* instead Logging initialized using configuration in jar:file:/root/apache-hive-0.13.1-bin/lib/hive-common-0.13.1.jar!/hive-log4j.properties
hive> dfs -cat /sqoop_t_user/*;
1,Tom,20
2,John,18
3,Div,25
4,Susan,31
5,Tiran,40
6,Shasita,13
我们看到的结果和mysql里面的数据一样。
应用场景:如果现在我们的需要处理/分析的数据都存在Mysql数据库里面,并且数据量比较大,我们想要通过离线分析这些数据。这时,我们就可以把Mysql里面的数据通过Sqoop导入到Hdfs里面,进行分析处理。
导入查询结果:
--导入查询结果
cd mysqoopdir vi mysql_query_to_hdfs import
--connect
jdbc:mysql://node1:3306/sqoop_db
--username
root
--password
'!QAZ2wsx3edc'
-e
select id, name from t_user where id >= 1 and $CONDITIONS
-m
1
--target-dir
/sqoop_t_user
--delete-target-dir :wq sqoop --options-file mysql_query_to_hdfs --检验
hive> dfs -cat /sqoop_t_user/*;
1,Tom
2,John
3,Div
4,Susan
5,Tiran
6,Shasita
9.2Mysql导入数据到Hive
--mysql导入到Hive
cd mysqoopdir vi mysql_to_hive import
--connect
jdbc:mysql://node1:3306/sqoop_db
--username
root
--password
'!QAZ2wsx3edc'
--table
t_user
-m
1
--create-hive-table
--target-dir
/sqoop_mysql_to_hive/
--hive-home
/home/hive/
--hive-import
--hive-table
t_sqoop_mysql_t_user_to_hive
--create-hive-table :wq sqoop --options-file mysql_to_hive --检验
hive> select * from t_sqoop_mysql_t_user_to_hive;
OK
1 Tom 20
2 John 18
3 Div 25
4 Susan 31
5 Tiran 40
6 Shasita 13
Time taken: 0.577 seconds, Fetched: 6 row(s)
9.3.Mysql导入到Hbase
--mysql导入到Hbase
cd mysqoopdir vi mysql_to_hbase import
--connect
jdbc:mysql://node1:3306/sqoop_db
--username
root
--password
'!QAZ2wsx3edc'
--table
t_user
-m
1
--hbase-table
t_sqoop_mysql_t_user_to_hbase
--hbase-row-key
id
--hbase-create-table
--column-family
cf :wq sqoop --options-file mysql_to_hbase --检验
hbase(main):004:0> scan 't_sqoop_mysql_t_user_to_hbase'
ROW COLUMN+CELL
1 column=cf:age, timestamp=1546680684317, value=20
1 column=cf:name, timestamp=1546680684317, value=Tom
2 column=cf:age, timestamp=1546680684317, value=18
2 column=cf:name, timestamp=1546680684317, value=John
3 column=cf:age, timestamp=1546680684317, value=25
3 column=cf:name, timestamp=1546680684317, value=Div
4 column=cf:age, timestamp=1546680684317, value=31
4 column=cf:name, timestamp=1546680684317, value=Susan
5 column=cf:age, timestamp=1546680684317, value=40
5 column=cf:name, timestamp=1546680684317, value=Tiran
6 column=cf:age, timestamp=1546680684317, value=13
6 column=cf:name, timestamp=1546680684317, value=Shasita
6 row(s) in 0.1730 seconds
9.4.Hdfs导出到Mysql
--Hdfs导出到Mysql
cd mysqoopdir vi hdfs_to_mysql export
--connect
jdbc:mysql://node1:3306/sqoop_db
--username
root
--password
'!QAZ2wsx3edc'
--table
t_hdfs_user_to_mysql
-m
1
--columns
id,name,age
--export-dir
/sqoop_t_user/ :wq --进入node1
mysql -u root -p
!QAZ2wsx3edc
show database;
use sqoop_db;
show tables;
--创建t_hdfs_user_to_mysql表
CREATE TABLE t_hdfs_user_to_mysql (id INT, name VARCHAR(20), age INT); cd mysqoopdir
sqoop --options-file hdfs_to_mysql --检验
--执行导入前
mysql> select * from t_hdfs_user_to_mysql;
Empty set (0.00 sec) --执行导入后
mysql> mysql> select * from t_hdfs_user_to_mysql;
+------+---------+------+
| id | name | age |
+------+---------+------+
| 1 | Tom | 20 |
| 2 | John | 18 |
| 3 | Div | 25 |
| 4 | Susan | 31 |
| 5 | Tiran | 40 |
| 6 | Shasita | 13 |
+------+---------+------+
6 rows in set (0.00 sec)
========================================================
More reading,and english is important.
I'm Hongten
大哥哥大姐姐,觉得有用打赏点哦!你的支持是我最大的动力。谢谢。
Hongten博客排名在100名以内。粉丝过千。
Hongten出品,必是精品。
E | hongtenzone@foxmail.com B | http://www.cnblogs.com/hongten
========================================================
关系数据库数据与hadoop数据进行转换的工具 - Sqoop的更多相关文章
- 大数据除了Hadoop还有哪些常用的工具?
大数据除了Hadoop还有哪些常用的工具? 1.Hadoop大数据生态平台Hadoop 是一个能够对大量数据进行分布式处理的软件框架.但是 Hadoop 是以一种可靠.高效.可伸缩的方式进行处理的.H ...
- 大数据之路week07--day06 (Sqoop 将关系数据库(oracle、mysql、postgresql等)数据与hadoop数据进行转换的工具)
为了方便后面的学习,在学习Hive的过程中先学习一个工具,那就是Sqoop,你会往后机会发现sqoop是我们在学习大数据框架的最简单的框架了. Sqoop是一个用来将Hadoop和关系型数据库中的数据 ...
- 数据集成工具—Sqoop
数据集成/采集/同步工具 @ 目录 数据集成/采集/同步工具 Sqoop简介 Sqoop安装 1.上传并解压 2.修改文件夹名字 3.修改配置文件 4.修改环境变量 5.添加MySQL连接驱动 6.测 ...
- sqoop数据迁移(基于Hadoop和关系数据库服务器之间传送数据)
1:sqoop的概述: (1):sqoop是apache旗下一款“Hadoop和关系数据库服务器之间传送数据”的工具.(2):导入数据:MySQL,Oracle导入数据到Hadoop的HDFS.HIV ...
- 大数据技术Hadoop入门理论系列之一----hadoop生态圈介绍
Technorati 标记: hadoop,生态圈,ecosystem,yarn,spark,入门 1. hadoop 生态概况 Hadoop是一个由Apache基金会所开发的分布式系统基础架构. 用 ...
- 大数据和Hadoop生态圈
大数据和Hadoop生态圈 一.前言: 非常感谢Hadoop专业解决方案群:313702010,兄弟们的大力支持,在此说一声辛苦了,经过两周的努力,已经有啦初步的成果,目前第1章 大数据和Hadoop ...
- Hadoop专业解决方案-第1章 大数据和Hadoop生态圈
一.前言: 非常感谢Hadoop专业解决方案群:313702010,兄弟们的大力支持,在此说一声辛苦了,经过两周的努力,已经有啦初步的成果,目前第1章 大数据和Hadoop生态圈小组已经翻译完成,在此 ...
- hadoop数据容易出现错误的地方
最近在搞关于数据分析的项目,做了一点总结. 下图是系统的数据流向.容易出现错误的地方.1.数据进入hadoop仓库有四种来源,这四种是最基本的数据,简称ods,original data source ...
- 【转】大数据以及Hadoop相关概念介绍
原博文出自于: http://www.cnblogs.com/xdp-gacl/p/4230220.html 感谢! 一.大数据的基本概念 1.1.什么是大数据 大数据指的就是要处理的数据是TB级别以 ...
随机推荐
- Tomcat 改端口
Tomcat 改端口 进入 tomcat 解压包下面的 conf 目录 打开文件 server.xml 找到以下 三处位置,并改为对应端口 1)找到 8005 <Server port=&quo ...
- Java读取一个文本文件拼接成一个字符串(readFileToString)
import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.I ...
- 分布式系列十: Redis安装和命令
redis是一个开源的, 内存数据结构存储, 一般用来作为数据库,缓存和消息代理. Redis的优势 多种数据结构 字符类型String 散列类型Hash 列表类型List 集合类型Set 有序集合类 ...
- return *this和return this有什么区别?
return *this返回的是当前对象的克隆或者本身(若返回类型为A, 则是克隆, 若返回类型为A&, 则是本身 ). return this返回当前对象的地址(指向当前对象的指针). 转: ...
- static 关键字 静态成员变量及静态成员函数
static类成员 类成员类似于C语言的全局变量,但是与全局变量又有所不同,例如,全局变量是可以被任何的用户代码所修改,而且全局变量破坏了对象的封装性. 使用类的 static 成员的优点 使用 st ...
- Filebeat 启动关闭流程
启动阶段: instance/beat.go #打印home路径.配置路径.数据路径和日志路径 seccomp #Syscall filter检查 instance/beat.go #beat inf ...
- Python logging日志系统
写我小小的日志系统 配置logging有以下几种方式: 1)使用Python代码显式的创建loggers, handlers和formatters并分别调用它们的配置函数: 2)创建一个日志配置文件, ...
- 【转载的】这张图能容易理解sql joins,收藏下!
- mysql 从一个表查询数据插入另一个表或当前表
mysql insert into 表明(uid,lng,lat) SELECT uuid,lng,lat FROM 表明
- DDT驱动selenium自动化测试
建两个.py文件分别是是读取xlsx文件内容,一个是测试用例使用ddt驱动 获取xlsx文件内容 import xlrd class ParseExcel(object): def __init__( ...