批量导入数据到HBase
hbase一般用于大数据的批量分析,所以在很多情况下需要将大量数据从外部导入到hbase中,
hbase提供了一种导入数据的方式,主要用于批量导入大量数据,即importtsv工具,用法如下:
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir> Imports the given input directory of TSV data into the specified table. The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key, and you must specify a column name for every column that exists in the
input data. Another special column HBASE_TS_KEY designates that this column should be
used as timestamp for each record. Unlike HBASE_ROW_KEY, HBASE_TS_KEY is optional.
You must specify atmost one column as timestamp key for each imported record.
Record with invalid timestamps (blank, non-numeric) will be treated as bad record.
Note: if you use this option, then 'importtsv.timestamp' option will be ignored. By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: if you do not use this option, then the target table must already exist in HBase Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
For performance consider the following options:
-Dmapred.map.tasks.speculative.execution=false
-Dmapred.reduce.tasks.speculative.execution=false hbase提供importtsv工具支持从tsv文件中将数据导入hbase。使用该工具将文本数据加载至hbase十分高效,因为它是通过mapreduce
job来实施导入的。哪怕是要从现有的关系型数据库中加载数据,也可以先将数据导入文本文件中,然后使用importtsv 工具导入hbase。
在导入海量数据时,这个方式运行的很好,因为导出数据比在关系型数据库中执行sql快很多。
importtsv工具不仅支持将数据直接加载进hbase的表中,还支持直接生成hbase自有格式文件(hfile),所以你可以用hbase的bulk
load工具将生成好的文件直接加载进运行中的hbase集群。这样就减少了在数据迁移过程中,数据传输与hbase加载时产生的网络流量。下文描述了
importtsv 和bulk load工具的使用场景。我们首先展示使用importtsv工具从tsv文件中将数据加载至hbase表中。
当然也会包含如何直接生成hbase自有格式文件,以及如何直接将已经生成好的文件加载入hbase
bulk-load的作用是用mapreduce的方式将hdfs上的文件装载到hbase中,对于海量数据装载入hbase非常有用.
测试如下:
landen@Master:~/UntarFile/hadoop-1.0.4$ bin/hadoop jar $HADOOP_HOME/lib/hbase-0.94.12.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,IPAddress:countrycode,IPAddress:countryname,IPAddress:region,IPAddress:regionname,IPAddress:city,IPAddress:latitude,IPAddress:longitude,IPAddress:timezone -Dimporttsv.bulk.output=/output HiddenIPInfo /input
Warning: $HADOOP_HOME is deprecated.
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:host.name=Master
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_17
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.home=/home/landen/UntarFile/jdk1.7.0_17/jre
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/home/landen/UntarFile/hadoop-1.0.4/conf:/home/landen/UntarFile/jdk1.7.0_17/lib/tools.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/..:/home/landen/UntarFile/hadoop-1.0.4/libexec/../hadoop-core-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/asm-3.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/aspectjrt-1.6.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/aspectjtools-1.6.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/chukwa-0.5.0-client.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/chukwa-0.5.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-beanutils-1.7.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-cli-1.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-codec-1.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-collections-3.2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-configuration-1.6.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-daemon-1.0.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-digester-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-el-1.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-httpclient-3.0.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-io-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-lang-2.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-logging-1.1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-logging-api-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-math-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-net-1.4.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/core-3.1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/guava-11.0.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-capacity-scheduler-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-fairscheduler-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-thriftfs-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hbase-0.94.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hsqldb-1.8.0.10.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jasper-compiler-5.5.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jasper-runtime-5.5.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jdeb-0.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-core-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-json-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-server-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jets3t-0.6.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jetty-6.1.26.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jetty-util-6.1.26.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsch-0.1.42.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/json-simple-1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/junit-4.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/kfs-0.2.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/LoadJsonData.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/log4j-1.2.15.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/mockito-all-1.8.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/oro-2.0.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/protobuf-java-2.4.0a.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/servlet-api-2.5-20081211.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/slf4j-api-1.4.3.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/xmlenc-0.52.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/zookeeper-3.4.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/native/Linux-i386-32
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:os.arch=i386
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:os.version=3.2.0-24-generic-pae
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:user.name=landen
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/landen
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/landen/UntarFile/hadoop-1.0.4
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=Slave1:2222,Master:2222,Slave2:2222 sessionTimeout=180000 watcher=hconnection
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Opening socket connection to server Slave1/10.21.244.124:2222. Will not attempt to authenticate using SASL (unknown error)
13/12/09 21:52:28 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 6809@Master
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Socket connection established to Slave1/10.21.244.124:2222, initiating session
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Session establishment complete on server Slave1/10.21.244.124:2222, sessionid = 0x142cbdf535f0010, negotiated timeout = 180000
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=Slave1:2222,Master:2222,Slave2:2222 sessionTimeout=180000 watcher=catalogtracker-on-org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@821075
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Opening socket connection to server Slave2/10.21.244.110:2222. Will not attempt to authenticate using SASL (unknown error)
13/12/09 21:52:28 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 6809@Master
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Socket connection established to Slave2/10.21.244.110:2222, initiating session
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: Session establishment complete on server Slave2/10.21.244.110:2222, sessionid = 0x242d5abedac0016, negotiated timeout = 180000
13/12/09 21:52:28 INFO zookeeper.ClientCnxn: EventThread shut down
13/12/09 21:52:28 INFO zookeeper.ZooKeeper: Session: 0x242d5abedac0016 closed
13/12/09 21:52:28 INFO mapreduce.HFileOutputFormat: Looking up current regions for table org.apache.hadoop.hbase.client.HTable@1ae6df8
13/12/09 21:52:28 INFO mapreduce.HFileOutputFormat: Configuring 1 reduce partitions to match current region count
13/12/09 21:52:28 INFO mapreduce.HFileOutputFormat: Writing partition information to hdfs://Master:9000/user/landen/partitions_b0c3723c-85ea-4828-8521-52de201023f0
13/12/09 21:52:28 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/12/09 21:52:28 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/12/09 21:52:28 INFO compress.CodecPool: Got brand-new compressor
13/12/09 21:52:29 INFO mapreduce.HFileOutputFormat: Incremental table output configured.
13/12/09 21:52:34 INFO input.FileInputFormat: Total input paths to process : 1
13/12/09 21:52:34 WARN snappy.LoadSnappy: Snappy native library not loaded
13/12/09 21:52:35 INFO mapred.JobClient: Running job: job_201312042044_0027
13/12/09 21:52:36 INFO mapred.JobClient: map 0% reduce 0%
13/12/09 21:53:41 INFO mapred.JobClient: map 100% reduce 0%
13/12/09 21:53:56 INFO mapred.JobClient: map 100% reduce 100%
13/12/09 21:54:01 INFO mapred.JobClient: Job complete: job_201312042044_0027
13/12/09 21:54:01 INFO mapred.JobClient: Counters: 30
13/12/09 21:54:01 INFO mapred.JobClient: Job Counters
13/12/09 21:54:01 INFO mapred.JobClient: Launched reduce tasks=1
13/12/09 21:54:01 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=42735
13/12/09 21:54:01 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/12/09 21:54:01 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/12/09 21:54:01 INFO mapred.JobClient: Launched map tasks=1
13/12/09 21:54:01 INFO mapred.JobClient: Data-local map tasks=1
13/12/09 21:54:01 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13878
13/12/09 21:54:01 INFO mapred.JobClient: ImportTsv
13/12/09 21:54:01 INFO mapred.JobClient: Bad Lines=0
13/12/09 21:54:01 INFO mapred.JobClient: File Output Format Counters
13/12/09 21:54:01 INFO mapred.JobClient: Bytes Written=2194
13/12/09 21:54:01 INFO mapred.JobClient: FileSystemCounters
13/12/09 21:54:01 INFO mapred.JobClient: FILE_BYTES_READ=1895
13/12/09 21:54:01 INFO mapred.JobClient: HDFS_BYTES_READ=333
13/12/09 21:54:01 INFO mapred.JobClient: FILE_BYTES_WRITTEN=77323
13/12/09 21:54:01 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2194
13/12/09 21:54:01 INFO mapred.JobClient: File Input Format Counters
13/12/09 21:54:01 INFO mapred.JobClient: Bytes Read=233
13/12/09 21:54:01 INFO mapred.JobClient: Map-Reduce Framework
13/12/09 21:54:01 INFO mapred.JobClient: Map output materialized bytes=1742
13/12/09 21:54:01 INFO mapred.JobClient: Map input records=3
13/12/09 21:54:01 INFO mapred.JobClient: Reduce shuffle bytes=1742
13/12/09 21:54:01 INFO mapred.JobClient: Spilled Records=6
13/12/09 21:54:01 INFO mapred.JobClient: Map output bytes=1724
13/12/09 21:54:01 INFO mapred.JobClient: Total committed heap usage (bytes)=131731456
13/12/09 21:54:01 INFO mapred.JobClient: CPU time spent (ms)=14590
13/12/09 21:54:01 INFO mapred.JobClient: Combine input records=0
13/12/09 21:54:01 INFO mapred.JobClient: SPLIT_RAW_BYTES=100
13/12/09 21:54:01 INFO mapred.JobClient: Reduce input records=3
13/12/09 21:54:01 INFO mapred.JobClient: Reduce input groups=3
13/12/09 21:54:01 INFO mapred.JobClient: Combine output records=0
13/12/09 21:54:01 INFO mapred.JobClient: Physical memory (bytes) snapshot=184393728
13/12/09 21:54:01 INFO mapred.JobClient: Reduce output records=24
13/12/09 21:54:01 INFO mapred.JobClient: Virtual memory (bytes) snapshot=698474496
13/12/09 21:54:01 INFO mapred.JobClient: Map output records=3
landen@Master:~/UntarFile/hadoop-1.0.4$ bin/hadoop fs -ls /output
Warning: $HADOOP_HOME is deprecated.
Found 3 items
drwxr-xr-x - landen supergroup 0 2013-12-09 21:53 /output/IPAddress
-rw-r--r-- 1 landen supergroup 0 2013-12-09 21:53 /output/_SUCCESS
drwxr-xr-x - landen supergroup 0 2013-12-09 21:52 /output/_logs
completebulkload 工具读取生成的文件,判断它们归属的族群,然后访问适当的族群服务器。族群服务器会将hfile文件转移进自身存储目录中,并且为客户端建立在线数据.
landen@Master:~/UntarFile/hadoop-1.0.4$ bin/hadoop jar $HADOOP_HOME/lib/hbase-0.94.12.jar completebulkload /output HiddenIPInfo(HBase对应表名)
Warning: $HADOOP_HOME is deprecated.
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:host.name=Master
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_17
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.home=/home/landen/UntarFile/jdk1.7.0_17/jre
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/home/landen/UntarFile/hadoop-1.0.4/conf:/home/landen/UntarFile/jdk1.7.0_17/lib/tools.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/..:/home/landen/UntarFile/hadoop-1.0.4/libexec/../hadoop-core-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/asm-3.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/aspectjrt-1.6.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/aspectjtools-1.6.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/chukwa-0.5.0-client.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/chukwa-0.5.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-beanutils-1.7.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-beanutils-core-1.8.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-cli-1.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-codec-1.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-collections-3.2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-configuration-1.6.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-daemon-1.0.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-digester-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-el-1.0.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-httpclient-3.0.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-io-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-lang-2.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-logging-1.1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-logging-api-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-math-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/commons-net-1.4.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/core-3.1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/guava-11.0.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-capacity-scheduler-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-fairscheduler-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hadoop-thriftfs-1.0.4.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hbase-0.94.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/hsqldb-1.8.0.10.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jackson-core-asl-1.8.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jasper-compiler-5.5.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jasper-runtime-5.5.12.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jdeb-0.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-core-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-json-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jersey-server-1.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jets3t-0.6.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jetty-6.1.26.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jetty-util-6.1.26.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsch-0.1.42.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/json-simple-1.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/junit-4.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/kfs-0.2.2.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/LoadJsonData.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/log4j-1.2.15.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/mockito-all-1.8.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/oro-2.0.8.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/protobuf-java-2.4.0a.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/servlet-api-2.5-20081211.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/slf4j-api-1.4.3.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/slf4j-log4j12-1.4.3.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/xmlenc-0.52.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/zookeeper-3.4.5.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-2.1.jar:/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/home/landen/UntarFile/hadoop-1.0.4/libexec/../lib/native/Linux-i386-32
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:os.arch=i386
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:os.version=3.2.0-24-generic-pae
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:user.name=landen
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/landen
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/landen/UntarFile/hadoop-1.0.4
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=Slave1:2222,Master:2222,Slave2:2222 sessionTimeout=180000 watcher=hconnection
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Opening socket connection to server Slave1/10.21.244.124:2222. Will not attempt to authenticate using SASL (unknown error)
13/12/09 22:00:00 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 7168@Master
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Socket connection established to Slave1/10.21.244.124:2222, initiating session
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Session establishment complete on server Slave1/10.21.244.124:2222, sessionid = 0x142cbdf535f0011, negotiated timeout = 180000
13/12/09 22:00:00 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=Slave1:2222,Master:2222,Slave2:2222 sessionTimeout=180000 watcher=catalogtracker-on-org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@a13b90
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Opening socket connection to server Slave1/10.21.244.124:2222. Will not attempt to authenticate using SASL (unknown error)
13/12/09 22:00:00 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 7168@Master
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Socket connection established to Slave1/10.21.244.124:2222, initiating session
13/12/09 22:00:00 INFO zookeeper.ClientCnxn: Session establishment complete on server Slave1/10.21.244.124:2222, sessionid = 0x142cbdf535f0012, negotiated timeout = 180000
13/12/09 22:00:01 INFO zookeeper.ZooKeeper: Session: 0x142cbdf535f0012 closed
13/12/09 22:00:01 INFO zookeeper.ClientCnxn: EventThread shut down
13/12/09 22:00:01 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://Master:9000/output/_SUCCESS
13/12/09 22:00:01 INFO hfile.CacheConfig: Allocating LruBlockCache with maximum size 222.2m
13/12/09 22:00:01 INFO util.ChecksumType: Checksum can use java.util.zip.CRC32
13/12/09 22:00:01 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://Master:9000/output/IPAddress/b29b74ad57ff4be1a62968229b7e23d4 first=125.111.251.118 last=60.180.248.201
landen@Master:~/UntarFile/hadoop-1.0.4$
在HBase shell中查询批量导入到HBase表HiddenIPInfo的数据:
hbase(main):045:0> scan 'HiddenIPInfo'
ROW COLUMN+CELL
125.111.251.118 column=IPAddress:city, timestamp=1386597147615, value=Ningbo
125.111.251.118 column=IPAddress:countrycode, timestamp=1386597147615, value=CN
125.111.251.118 column=IPAddress:countryname, timestamp=1386597147615, value=China
125.111.251.118 column=IPAddress:latitude, timestamp=1386597147615, value=29.878204
125.111.251.118 column=IPAddress:longitude, timestamp=1386597147615, value=121.5495
125.111.251.118 column=IPAddress:region, timestamp=1386597147615, value=02
125.111.251.118 column=IPAddress:regionname, timestamp=1386597147615, value=Zhejiang
125.111.251.118 column=IPAddress:timezone, timestamp=1386597147615, value=Asia/Shanghai
221.12.10.218 column=IPAddress:city, timestamp=1386597147615, value=Hangzhou
221.12.10.218 column=IPAddress:countrycode, timestamp=1386597147615, value=CN
221.12.10.218 column=IPAddress:countryname, timestamp=1386597147615, value=China
221.12.10.218 column=IPAddress:latitude, timestamp=1386597147615, value=30.293594
221.12.10.218 column=IPAddress:longitude, timestamp=1386597147615, value=120.16141
221.12.10.218 column=IPAddress:region, timestamp=1386597147615, value=02
221.12.10.218 column=IPAddress:regionname, timestamp=1386597147615, value=Zhejiang
221.12.10.218 column=IPAddress:timezone, timestamp=1386597147615, value=Asia/Shanghai
60.180.248.201 column=IPAddress:city, timestamp=1386597147615, value=Wenzhou
60.180.248.201 column=IPAddress:countrycode, timestamp=1386597147615, value=CN
60.180.248.201 column=IPAddress:countryname, timestamp=1386597147615, value=China
60.180.248.201 column=IPAddress:latitude, timestamp=1386597147615, value=27.999405
60.180.248.201 column=IPAddress:longitude, timestamp=1386597147615, value=120.66681
60.180.248.201 column=IPAddress:region, timestamp=1386597147615, value=02
60.180.248.201 column=IPAddress:regionname, timestamp=1386597147615, value=Zhejiang
60.180.248.201 column=IPAddress:timezone, timestamp=1386597147615, value=Asia/Shanghai
3 row(s) in 0.2640 seconds
Note:
1> HBASE_ROW_KEY可以不在第一列,如果在第二列,则第二列作为row key;
2> tsv文件的字段索引与hbase表中列的对应信息是对 -dimporttsv.columns参数进行设置;
3> 如果设置了输出目录-Dimporttsv.bulk.output, HiddenIPInfo表还暂时不会生成,只是将hfile输出到output文件夹下(当进行completebulkload导入操作后HiddenIPInfo表才会生成); 然后执行bin/hadoop jar hbase-VERSION.jar completebulkload /output(HFile文件存放目录) HiddenIPInfo(对应的HBase表名)操作将这个输出目录中的hfile文件转移到对应的region中,这一步因为只是mv,所以相当快;
4> 如果数据特别大,而表中原来就有region,那么会执行切分工作,查找数据对应的region并装载;
5> bin/hadoop jar $HADOOP_HOME/lib/hbase-0.94.12.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,IPAddress:countrycode,IPAddress:countryname,IPAddress:region,IPAddress:regionname,IPAddress:city,IPAddress:latitude,IPAddress:longitude,IPAddress:timezone (-Dimporttsv.bulk.output=/output) HiddenIPInfo /input ,当未指定-Dimporttsv.bulk.output参数时,则:
1. 执行命令前,表需已创建完成;
2. 此方式采用Put方法向hbase写入数据,性能较低,在map阶段使用的是tableoutputformat. 通过指定-Dimporttsv.bulk.output参数,importtsv工具可以直接生成StorageFile,使用hfileoutputformat来代替在hdfs中生成hbase的自有格式文件(hfile),然后配合CompleteBulkLoad工具来加载生成的文件到一个运行的集群中并导入hbase,性能更好, 如果表不存在,CompleteBulkLoad工具会自动创建;
6> importtsv工具只从hdfs中读取数据,所以一开始我们需要将tsv文件从本地文件系统拷贝到hdfs中。importtsv工具要求源文件满足TSV格式,关于TSV文件格式,可参考:http://en.wikipedia.org/wiki/Tab-separated_values,获取源文件后,先将源文件导入到hdfs中,hadoop dfs -copyFromLocal file:///path/to/source-file hdfs:///path/to/source-file,即源文件默认以"\t"为分割符,如果需要换成其它分割符,在执行时加上-Dimporttsv.separator=",",则变成了以","分割.
批量导入数据到HBase的更多相关文章
- csv文件批量导入数据到sqlite。
csv文件批量导入数据到sqlite. 代码: f = web.input(bs_switch = {}) # bs_switch 为from表单file字段的namedata =[i.split( ...
- 使用python向Redis批量导入数据
1.使用pipeline进行批量导入数据.包含先使用rpush插入数据,然后使用expire改动过期时间 class Redis_Handler(Handler): def connect(self) ...
- Cassandra使用pycassa批量导入数据
本周接手了一个Cassandra系统的维护工作,有一项是需要将应用方的数据导入我们维护的Cassandra集群,并且为应用方提供HTTP的方式访问服务.这是我第一次接触KV系统,原来只是走马观花似的看 ...
- Redis批量导入数据的方法
有时候,我们需要给redis库中插入大量的数据,如做性能测试前的准备数据.遇到这种情况时,偶尔可能也会懵逼一下,这里就给大家介绍一个批量导入数据的方法. 先准备一个redis protocol的文件( ...
- 项目总结04:SQL批量导入数据:将具有多表关联的Excel数据,通过sql语句脚本的形式,导入到数据库
将具有多表关联的Excel数据,通过sql语句脚本的形式,导入到数据库 写在前面:本文用的语言是java:数据库是MySql: 需求:在实际项目中,经常会被客户要求,做批量导入数据:一般的简单的单表数 ...
- 批量导入数据到mssql数据库的
概述 批量导入数据到数据库中,我们有好几种方式. 从一个数据表里生成数据脚本,到另一个数据库里执行脚本 从EXCEL里导入数据 上面两种方式,导入的数据都会生成大量的日志.如果批量导入5W条数据到数据 ...
- asp.net线程批量导入数据时通过ajax获取执行状态
最近因为工作中遇到一个需求,需要做了一个批量导入功能,但长时间运行没个反馈状态,很容易让人看了心急,产生各种臆想!为了解决心里障碍,写了这么个功能. 通过线程执行导入,并把正在执行的状态存入sessi ...
- ADO.NET 对数据操作 以及如何通过C# 事务批量导入数据
ADO.NET 对数据操作 以及如何通过C# 事务批量导入数据 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ...
- asp.net 线程批量导入数据,ajax获取执行状态
最近做了一个批量导入功能,长时间运行,没个反馈状态,很容易让人看了心急,产生各种臆想!为了解决心里障碍,写了这么个功能. 通过线程执行导入,并把正在执行的状态存入session,既共享执行状态,通过a ...
随机推荐
- python面向对象-1方法、构造函数
类是指:描述一种事物的定义,是个抽象的概念 实例指:该种事物的一个具体的个体,是具体的东西 打个比方: “人”是一个类.“张三”是人类的一个具体例子 在编程时也是同样的道理,你先自己定义一个“类”,当 ...
- UVaLive 3487 Duopoly (最小割)
题意:有两个公司A和B在申请一些资源,现在给出两个公司所申请的内容,内容包括价钱和申请的资源 ,现在你做为官方,你只能拒绝一个申请或者接受一个申请,同一个资源不能两个公司都拥有,且申请的资源不能只给部 ...
- AE IRasterCursor 获取栅格图层像素值
在编写使用栅格图层的代码时,常常要获取栅格图层的像素值(PixelValue).如果想获取某一点的像素值,可以使用IRaster2中的getPixelValue方法.但如果想要获得的是图层中的某一块甚 ...
- dlib安装教程(for linux)
https://blog.csdn.net/LoHiauFung/article/details/78454905 https://www.linuxidc.com/Linux/2017-11/148 ...
- Shell编程-05-Shell中条件测试与比较
目录 Shell脚本条件测试 Shell文件测试 Shell字符测试 Shell整数测试 Shell逻辑测试 Shell条件测试总结 Shell脚本条件测试 在Shell脚本中各种条件结构和流 ...
- (字典树模板)统计难题--hdu--1251
链接: http://acm.hdu.edu.cn/showproblem.php?pid=1251 在自己敲了一遍后终于懂了,这不就用了链表的知识来建立了树,对!就是这样的,然后再查找 代码: #i ...
- pytest 常用命令行选项(二)
本文接上篇继续简介pytest常用的命令行选项. 8.-v(--verbose) 选项 使用-v/--verbose选项,输出的信息会更详细.最明显的区别就是每个文件中的每个测试用例都占一行,测试的名 ...
- Docker搭建 MySQL 主从复制
为什么选 Docker 搭建主从复制需要两个以上的MySQL, 使用 Docker 非常方便.如果以前没用过,找个简单的文档看看,熟悉一下命令. 搭建过程 1.下载镜像 docker pull mys ...
- 与数据库连接的页面增删改查 的easyui实现(主要是前端实现)
一.首先看一下最终实现的效果,上图 二.思路,主要是分两个文件实现,一个是页面显示文件:代码如下: <html> <head> <title>示例管理</ti ...
- [Openwrt 项目开发笔记]:MySQL配置(六)
[Openwrt项目开发笔记]系列文章传送门:http://www.cnblogs.com/double-win/p/3888399.html 正文: 在本人的项目中,运行在路由器上的服务器采用Ngi ...