使用sqoop把mysql数据导入hive

export HADOOP_COMMON_HOME=/hadoop
export HADOOP_MAPRED_HOME=/hadoop
 
cp /hive/lib/mysql-connector-java-5.1.25-bin.jar /sqoop/lib/

share表第一列为自增主键 share_id,share_id<1000的数据共有999条:

mysql> SELECT COUNT(*) FROM share WHERE share_id<1000;
+----------+
| COUNT(*) |
+----------+
| 999 |
+----------+
1 ROW IN SET (0.00 sec)

使用sqoop导入到hive

[hduser@www lib]$ /sqoop/bin/sqoop import --connect jdbc:mysql://localhost/shipincon --table share --username root --password xxx --hive-import --where "share_id<1000"  -- --default-character-set=utf8
 
 
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
--password 密码 比较危险,可以改成 -P (Read password from console)
13/07/12 18:42:58 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
默认是用hive的分隔符 ctrl+a 可以使用 --fields-terminated-by 来更改
插一句:在console vim中如何输入ctrl+a: 在输入模式下先按ctrl+v,再按ctrl+a即可,会显示成^A
13/07/12 18:42:58 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
13/07/12 18:42:58 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
13/07/12 18:42:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
13/07/12 18:42:58 INFO tool.CodeGenTool: Beginning code generation
13/07/12 18:42:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1
13/07/12 18:42:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1
13/07/12 18:42:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop
Note: /tmp/sqoop-hduser/compile/720a3072de33c8a826cdf96d9eced686/share.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/07/12 18:43:00 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/720a3072de33c8a826cdf96d9eced686/share.jar
13/07/12 18:43:00 WARN manager.MySQLManager: It looks like you are importing from mysql.
如果没有blob clob longvarbinary字段,可以使用 --direct 速度也许会快一些。 text字段不清楚。
13/07/12 18:43:00 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
13/07/12 18:43:00 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
13/07/12 18:43:00 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
13/07/12 18:43:00 INFO mapreduce.ImportJobBase: Beginning import of share
13/07/12 18:43:02 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`share_id`), MAX(`share_id`) FROM `share` WHERE ( share_id<1000 )
13/07/12 18:43:02 INFO mapred.JobClient: Running job: job_201307121751_0001
13/07/12 18:43:03 INFO mapred.JobClient: map 0% reduce 0%
13/07/12 18:43:12 INFO mapred.JobClient: map 50% reduce 0%
13/07/12 18:43:17 INFO mapred.JobClient: map 100% reduce 0%
13/07/12 18:43:18 INFO mapred.JobClient: Job complete: job_201307121751_0001
13/07/12 18:43:18 INFO mapred.JobClient: Counters: 18
13/07/12 18:43:18 INFO mapred.JobClient: Job Counters
13/07/12 18:43:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=20723
13/07/12 18:43:18 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/12 18:43:18 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
################################ map tasks =4 就会生成四个文件
13/07/12 18:43:18 INFO mapred.JobClient: Launched map tasks=4
13/07/12 18:43:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/07/12 18:43:18 INFO mapred.JobClient: File Output Format Counters
13/07/12 18:43:18 INFO mapred.JobClient: Bytes Written=202094
13/07/12 18:43:18 INFO mapred.JobClient: FileSystemCounters
13/07/12 18:43:18 INFO mapred.JobClient: HDFS_BYTES_READ=455
13/07/12 18:43:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=260292
13/07/12 18:43:18 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=202094
13/07/12 18:43:18 INFO mapred.JobClient: File Input Format Counters
13/07/12 18:43:18 INFO mapred.JobClient: Bytes Read=0
13/07/12 18:43:18 INFO mapred.JobClient: Map-Reduce Framework
13/07/12 18:43:18 INFO mapred.JobClient: Map input records=999
13/07/12 18:43:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=179216384
13/07/12 18:43:18 INFO mapred.JobClient: Spilled Records=0
13/07/12 18:43:18 INFO mapred.JobClient: CPU time spent (ms)=3360
13/07/12 18:43:18 INFO mapred.JobClient: Total committed heap usage (bytes)=65011712
13/07/12 18:43:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1519448064
13/07/12 18:43:18 INFO mapred.JobClient: Map output records=999
13/07/12 18:43:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=455
13/07/12 18:43:18 INFO mapreduce.ImportJobBase: Transferred 197.3574 KB in 18.0516 seconds (10.9329 KB/sec)
共999行记录
13/07/12 18:43:18 INFO mapreduce.ImportJobBase: Retrieved 999 records.
13/07/12 18:43:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1
13/07/12 18:43:19 WARN hive.TableDefWriter: Column add_time had to be cast to a less precise type in Hive
add_time的精度变小了。。为嘛? mysql中 `add_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
mysql> select add_time from share limit 1;
+---------------------+
| add_time |
+---------------------+
| 2012-04-08 17:17:13 |
+---------------------+
1 row in set (0.00 sec)
到了hive中会变成:
2012-04-08 17:17:13.0
貌似精度是变大了。
不过据说某些数据库的timestamp字段支持到了毫秒的精度。
13/07/12 18:43:19 INFO hive.HiveImport: Removing temporary files from import process: hdfs://localhost:54310/user/hduser/share/_logs
13/07/12 18:43:19 INFO hive.HiveImport: Loading uploaded data into Hive
13/07/12 18:43:21 INFO hive.HiveImport: WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
13/07/12 18:43:22 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-0.10.0.jar!/hive-log4j.properties
13/07/12 18:43:22 INFO hive.HiveImport: Hive history file=/tmp/hduser/hive_job_log_hduser_201307121843_1382245494.txt
13/07/12 18:43:25 INFO hive.HiveImport: OK
13/07/12 18:43:25 INFO hive.HiveImport: Time taken: 2.389 seconds
13/07/12 18:43:25 INFO hive.HiveImport: Loading data to table default.share
13/07/12 18:43:26 INFO hive.HiveImport: Table default.share stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 202094, raw_data_size: 0]
13/07/12 18:43:26 INFO hive.HiveImport: OK
13/07/12 18:43:26 INFO hive.HiveImport: Time taken: 0.726 seconds
13/07/12 18:43:26 INFO hive.HiveImport: Hive import complete.
13/07/12 18:43:26 INFO hive.HiveImport: Export directory is empty, removing it.
[hduser@www lib]$
 
 
 
--default-character-set=utf8之前还有 --
 
Arguments to mysqldump and other subprograms may be supplied
after a '--' on the command line.

去hive中看看数据:

[hduser@www lib]$ /hive/bin/hive
WARNING: org.apache.hadoop.metrics.jvm.EventCounter IS deprecated. Please USE org.apache.hadoop.log.metrics.EventCounter IN ALL the log4j.properties files.
Logging initialized USING configuration IN jar:file:/usr/LOCAL/hive/lib/hive-common-0.10.0.jar!/hive-log4j.properties
Hive history file=/tmp/hduser/hive_job_log_hduser_201307121854_1566000179.txt
hive> USE shipincon;
FAILED: Error IN metadata: ERROR: The DATABASE shipincon does NOT exist.
FAILED: Execution Error, RETURN code 1 FROM org.apache.hadoop.hive.ql.EXEC.DDLTask
hive> SHOW DATABASES;
OK
DEFAULT
TIME taken: 0.244 seconds
hive> USE DEFAULT;
OK
TIME taken: 0.018 seconds
hive> SHOW TABLES;
OK
a
cleanmaster
cm
login_info
login_interm
share
TIME taken: 0.155 seconds
hive> SELECT COUNT(*) FROM sahre;
FAILED: SemanticException [Error 10001]: Line 1:21 TABLE NOT found 'sahre'
hive> SELECT COUNT(*) FROM share
> ;
Total MapReduce jobs = 1
Launching Job 1 OUT OF 1
NUMBER OF reduce tasks determined at compile TIME: 1
IN ORDER TO CHANGE the average LOAD FOR a reducer (IN bytes):
SET hive.EXEC.reducers.bytes.per.reducer=<number>
IN ORDER TO LIMIT the maximum NUMBER OF reducers:
SET hive.EXEC.reducers.MAX=<number>
IN ORDER TO SET a constant NUMBER OF reducers:
SET mapred.reduce.tasks=<number>
Starting Job = job_201307121751_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201307121751_0002
KILL Command = /usr/LOCAL/hadoop/libexec/../bin/hadoop job -KILL job_201307121751_0002
Hadoop job information FOR Stage-1: NUMBER OF mappers: 1; NUMBER OF reducers: 1
2013-07-12 18:54:37,120 Stage-1 map = 0%, reduce = 0%
2013-07-12 18:54:40,143 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2013-07-12 18:54:41,154 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2013-07-12 18:54:42,162 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2013-07-12 18:54:43,168 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2013-07-12 18:54:44,175 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2013-07-12 18:54:45,181 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2013-07-12 18:54:46,192 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2013-07-12 18:54:47,199 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
2013-07-12 18:54:48,208 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 0.82 sec
2013-07-12 18:54:49,219 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.74 sec
2013-07-12 18:54:50,238 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.74 sec
MapReduce Total cumulative CPU TIME: 1 seconds 740 msec
Ended Job = job_201307121751_0002
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.74 sec HDFS READ: 202547 HDFS WRITE: 4 SUCCESS
Total MapReduce CPU TIME Spent: 1 seconds 740 msec
OK
999
TIME taken: 18.573 seconds
hive> SHOW CREATE TABLE share;
OK
CREATE TABLE share(
share_id BIGINT,
add_time string,
tag string)
COMMENT 'Imported by sqoop on 2013/07/12 19:48:33'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:54310/user/hive/warehouse2/share'
TBLPROPERTIES (
'numPartitions'='0',
'numFiles'='4',
'transient_lastDdlTime'='1373629961',
'totalSize'='227161',
'numRows'='0',
'rawDataSize'='0')
TIME taken: 0.292 seconds
 
hive>

再去dfs里面看看:

[hduser@www lib]$ /hadoop/bin/hadoop dfs -ls /user/hive/warehouse2/share
Found 5 items
-rw-r--r-- 1 hduser supergroup 0 2013-07-12 18:43 /user/hive/warehouse2/share/_SUCCESS
-rw-r--r-- 1 hduser supergroup 53226 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00000
-rw-r--r-- 1 hduser supergroup 50185 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00001
-rw-r--r-- 1 hduser supergroup 50634 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00002
-rw-r--r-- 1 hduser supergroup 48049 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00003

乍一看,像是4个partition。 可是细看文件名,并不是partition的那种结构:
比如:/user/hive/warehouse2/cm/country_name=VN_en

把文件复制到本地:

/hadoop/bin/hadoop dfs -get /user/hive/warehouse2/share  ~/
[hduser@www share]$ cd ~/share
[hduser@www share]$ wc -l *
250 part-m-00000
250 part-m-00001
249 part-m-00002
250 part-m-00003
0 _SUCCESS
999 total

–hive-overwrite 可以覆盖hive中的现有数据。
–hive-table
Sets the table name to use when importing to Hive.
可以把数据import到某个分区中:
–hive-partition-key Name of a hive field to partition are sharded on
–hive-partition-value String-value that serves as partition key for this imported into hive in this job.

可以写个cron,周期性地把share表的新增数据导入到hive中了: –append

参考:

http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html

http://www.mysqlperformanceblog.com/2013/07/11/mysql-and-hadoop/

使用sqoop把mysql数据导入hive的更多相关文章

  1. python脚本 用sqoop把mysql数据导入hive

    转:https://blog.csdn.net/wulantian/article/details/53064123 用python把mysql数据库的数据导入到hive中,该过程主要是通过pytho ...

  2. 使用 sqoop 将mysql数据导入到hive表(import)

    Sqoop将mysql数据导入到hive表中 先在mysql创建表 CREATE TABLE `sqoop_test` ( `id` ) DEFAULT NULL, `name` varchar() ...

  3. 使用sqoop将mysql数据导入到hive中

    首先准备工具环境:hadoop2.7+mysql5.7+sqoop1.4+hive3.1 准备一张数据库表: 接下来就可以操作了... 一.将MySQL数据导入到hdfs 首先我测试将zhaopin表 ...

  4. Sqoop将mysql数据导入hbase的血与泪

    Sqoop将mysql数据导入hbase的血与泪(整整搞了大半天)  版权声明:本文为yunshuxueyuan原创文章.如需转载请标明出处: https://my.oschina.net/yunsh ...

  5. 使用 sqoop 将mysql数据导入到hdfs(import)

    Sqoop 将mysql 数据导入到hdfs(import) 1.创建mysql表 CREATE TABLE `sqoop_test` ( `id` ) DEFAULT NULL, `name` va ...

  6. sqoop将mysql数据导入hbase、hive的常见异常处理

    原创不易,如需转载,请注明出处https://www.cnblogs.com/baixianlong/p/10700700.html,否则将追究法律责任!!! 一.需求: 1.将以下这张表(test_ ...

  7. 使用sqoop将mysql数据导入到hadoop

    hadoop的安装配置这里就不讲了. Sqoop的安装也很简单. 完成sqoop的安装后,可以这样测试是否可以连接到mysql(注意:mysql的jar包要放到 SQOOP_HOME/lib 下): ...

  8. 使用sqoop将mysql中表导入hive中报错

    [hdfs@node1 root]$ sqoop import --connect jdbc:mysql://node2:3306/cm?charset-utf8 --username root -- ...

  9. sqoop导出mysql数据进入hive错误

    看mr的运行显示:sqoop job可以获得的select max(xxx)结果,但是当mr开始时却显示大片错误,就是连接超时,和连接重置等问题, 最后去每个节点ping mysql的ip地址,发现 ...

随机推荐

  1. 与时间有关的类Date,DateFormat,Calendar

    Date类用于表示日期和时间.它没考虑国际化问题,所以又设计了另外两个类. Calendar类: 主要是进行日期字段之间的相互操作. 编程实例:计算出距当前日期时间315天后的日期时间,并使用”xxx ...

  2. XMLRPC 学习笔记(一)- Python 实现

    参考文章: http://baike.baidu.com/view/643379.htm http://docs.python.org/2/library/xmlrpclib.html http:// ...

  3. MFC多国语言——资源DLL

    此随笔中主要内容来自http://blog.csdn.net/china_hxx/article/details/10066655,原出处不详. 以下内容基于VC 6.0.要实现界面多语言化,必须要先 ...

  4. 华为OJ平台试题 —— 数组:输入n个整数,输出当中最小的k个

    输入n个整数.输出当中最小的k个: 代码: /*  * 输入n个整数,输出当中最小的k个.  * 输入说明:1.输入两个整数:2.输入一个整数数组  * 输出说明:输出一个整数数组  */ <p ...

  5. UE问题分部解决

    0.寻找Actor ALandscape *land=nullptr; for (TActorIterator<ALandscape> It(GEditor->GetEditorWo ...

  6. 手动编译svn

    #!/bin/bash yum -y remove subversionmkdir -p /dist/{dist,src}cd /dist/dist/bin/rm -f openssl* subver ...

  7. laragon 使用(php版本升级切换)

    php版本升级切换 1.下载对应不带 nts 的php版本(http://windows.php.net/download/),然后解压放到\laragon\bin\php目录下 2.记得改环境变量下 ...

  8. poj3414

    Pots Time Limit: 1000MS   Memory Limit: 65536K Total Submissions: 13545   Accepted: 5717   Special J ...

  9. vs2017 git 操作重置、还原、挑拣对比

    工具 :vs2017 git 操作 背景:本地与远程分支同步 操作:还原.挑拣.重置--hard .重置--mixed 分支:本地1.本地2.origin\本地1   基本操作 1:分支:本地2-ad ...

  10. VS2013\VS2017 使用git 总是需要输入账号密码

    问题: VS2013\VS2017 使用git 总是需要输入账号密码 解决方案:删除原凭证,或者修改原凭证,重新输入一次账号和密码并且选择“记住凭证”即可!