于Spark它是一个计算框架,于Spark环境,不仅支持单个文件操作,HDFS档,同时也可以使用Spark对Hbase操作。

从企业的数据源HBase取出。这涉及阅读hbase数据,在本文中尽快为了尽可能地让我们可以实践和操作Hbase。Spark Shell 来进行Hbase操作。

一、环境:

Haoop2.2.0

Hbase版本号0.96.2-hadoop2, r1581096

Spark1.0.0

本文如果环境已经搭建好,Spark环境搭建可见Spark Haoop集群搭建

Hadoop2.2.0要注意和Hbase的版本号兼容,这里Hbase採用0.96.2

二、原理

Spark操作HBase事实上是和java client操作HBase的原理是一致的:

scala和java都是基于jvm的语言。仅仅要将hbase的类载入到classpath内,就可以调用操作,其他框架相似。

同样点:即都是当作client来连接HMaster,然后利用hbase的API来对Hbase进行操作。

不同点:唯一不同的是:Spark能够将Hbase的数据来当作RDD处理,从而利用Spark来进行并行计算。

三、实践

1、首先检查依赖jar包。在这之前如果hbase的jar包不在spark-shell的classpath里。则须要加入进来。
设置方法: 在Spark-evn.sh里加入SPARK_CLASSPATH=/home/victor/software/hbase/lib/*
这样再再启动启动bin/spark-shell, 启动完成而且Worker成功注冊上之后。import jar 包。 

2、操作hbase

2.1 Hbase中数据

hbase里有张score表,里面有2个CF。分别为course和grade。数据例如以下:
hbase(main):001:0> scan 'scores'
ROW                                    COLUMN+CELL                                                                                                     
 Jim                                   column=course:art, timestamp=1404142440676, value=67                                                            
 Jim                                   column=course:math, timestamp=1404142434405, value=77                                                           
 Jim                                   column=grade:, timestamp=1404142422653, value=3                                                                 
 Tom                                   column=course:art, timestamp=1404142407018, value=88                                                            
 Tom                                   column=course:math, timestamp=1404142398986, value=97                                                           
 Tom                                   column=grade:, timestamp=1404142383206, value=5                                                                 
 shengli                               column=course:art, timestamp=1404142468266, value=17                                                            
 shengli                               column=course:math, timestamp=1404142461952, value=27                                                           
 shengli                               column=grade:, timestamp=1404142452157, value=8                                                                 
3 row(s) in 0.3230 seconds

2.1  初始化连接參数

scala> import org.apache.spark._
import org.apache.spark._ scala> import org.apache.spark.rdd.NewHadoopRDD
import org.apache.spark.rdd.NewHadoopRDD scala> import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.conf.Configuration scala> import org.apache.hadoop.hbase.HBaseConfiguration;  
import org.apache.hadoop.hbase.HBaseConfiguration scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat scala> val configuration = HBaseConfiguration.create();  //初始化配置
configuration: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hbase-default.xml, hbase-site.xml scala> configuration.set("hbase.zookeeper.property.clientPort", "2181"); //设置zookeeper client端口 scala> configuration.set("hbase.zookeeper.quorum", "localhost");  //设置zookeeper quorum scala> configuration.set("hbase.master", "localhost:60000");  //设置hbase master scala> configuration.addResource("/home/victor/software/hbase/conf/hbase-site.xml")  //将hbase的配置载入
scala> configuration.set(TableInputFormat.INPUT_TABLE, "scores")

scala> import org.apache.hadoop.hbase.client.HBaseAdminimport org.apache.hadoop.hbase.client.HBaseAdmin

scala> val hadmin = new HBaseAdmin(configuration); //实例化hbase管理
2014-07-01 00:39:24,649 INFO [main] zookeeper.ZooKeeper (ZooKeeper.java:<init>(438)) - Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0xc7eea5, quorum=localhost:2181, baseZNode=/hbase
2014-07-01 00:39:24,707 INFO [main] zookeeper.RecoverableZooKeeper (RecoverableZooKeeper.java:<init>(120)) - Process identifier=hconnection-0xc7eea5 connecting to ZooKeeper ensemble=localhost:2181
2014-07-01 00:39:24,753 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2014-07-01 00:39:24,755 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(849)) - Socket connection established to localhost/127.0.0.1:2181, initiating session
2014-07-01 00:39:24,938 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1207)) - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x146ed61c4ef0015, negotiated timeout = 40000
hadmin: org.apache.hadoop.hbase.client.HBaseAdmin = org.apache.hadoop.hbase.client.HBaseAdmin@1260466


接下来用haoop api来创建一个RDD

scala> val hrdd = sc.newAPIHadoopRDD(configuration, classOf[TableInputFormat],
| classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
| classOf[org.apache.hadoop.hbase.client.Result])
2014-07-01 00:51:06,683 WARN [main] util.SizeEstimator (Logging.scala:logWarning(70)) - Failed to check whether UseCompressedOops is set; assuming yes
2014-07-01 00:51:06,936 INFO [main] storage.MemoryStore (Logging.scala:logInfo(58)) - ensureFreeSpace(85877) called with curMem=0, maxMem=308910489
2014-07-01 00:51:06,946 INFO [main] storage.MemoryStore (Logging.scala:logInfo(58)) - Block broadcast_0 stored as values to memory (estimated size 83.9 KB, free 294.5 MB)
hrdd: org.apache.spark.rdd.RDD[(org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result)] = NewHadoopRDD[0] at newAPIHadoopRDD at <console>:22

版本号一:(最新的版本号下列code可能不work,请看版本号二)

读取记录:

这里我们take 1 条数据,能够看到格式是依照我们设定的HadoopRDD。key是一个不变的ImmutableBytesWritable,value是Hbase的Result
scala> hrdd take 1
2014-07-01 00:51:50,371 INFO [main] spark.SparkContext (Logging.scala:logInfo(58)) - Starting job: take at <console>:25
2014-07-01 00:51:50,423 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Got job 0 (take at <console>:25) with 1 output partitions (allowLocal=true)
2014-07-01 00:51:50,425 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Final stage: Stage 0(take at <console>:25)
2014-07-01 00:51:50,426 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Parents of final stage: List()
2014-07-01 00:51:50,477 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Missing parents: List()
2014-07-01 00:51:50,478 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Computing the requested partition locally
2014-07-01 00:51:50,509 INFO [Local computation of job 0] rdd.NewHadoopRDD (Logging.scala:logInfo(58)) - Input split: localhost:,
2014-07-01 00:51:50,894 INFO [main] spark.SparkContext (Logging.scala:logInfo(58)) - Job finished: take at <console>:25, took 0.522612687 s
res5: Array[(org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result)] = Array((4a 69 6d,keyvalues={Jim/course:art/1404142440676/Put/vlen=2/mvcc=0, Jim/course:math/1404142434405/Put/vlen=2/mvcc=0, Jim/grade:/1404142422653/Put/vlen=1/mvcc=0}))

找到Result对象

scala> val res = hrdd.take(1)
2014-07-01 01:09:13,486 INFO [main] spark.SparkContext (Logging.scala:logInfo(58)) - Starting job: take at <console>:24
2014-07-01 01:09:13,487 INFO [spark-akka.actor.default-dispatcher-15] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Got job 4 (take at <console>:24) with 1 output partitions (allowLocal=true)
2014-07-01 01:09:13,487 INFO [spark-akka.actor.default-dispatcher-15] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Final stage: Stage 4(take at <console>:24)
2014-07-01 01:09:13,487 INFO [spark-akka.actor.default-dispatcher-15] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Parents of final stage: List()
2014-07-01 01:09:13,488 INFO [spark-akka.actor.default-dispatcher-15] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Missing parents: List()
2014-07-01 01:09:13,488 INFO [spark-akka.actor.default-dispatcher-15] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Computing the requested partition locally
2014-07-01 01:09:13,488 INFO [Local computation of job 4] rdd.NewHadoopRDD (Logging.scala:logInfo(58)) - Input split: localhost:,
2014-07-01 01:09:13,504 INFO [main] spark.SparkContext (Logging.scala:logInfo(58)) - Job finished: take at <console>:24, took 0.018069267 s
res: Array[(org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result)] = Array((4a 69 6d,keyvalues={Jim/course:art/1404142440676/Put/vlen=2/mvcc=0, Jim/course:math/1404142434405/Put/vlen=2/mvcc=0, Jim/grade:/1404142422653/Put/vlen=1/mvcc=0})) scala> res(0)
res33: (org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result) = (4a 69 6d,keyvalues={Jim/course:art/1404142440676/Put/vlen=2/mvcc=0, Jim/course:math/1404142434405/Put/vlen=2/mvcc=0, Jim/grade:/1404142422653/Put/vlen=1/mvcc=0}) scala> res(0)._2
res34: org.apache.hadoop.hbase.client.Result = keyvalues={Jim/course:art/1404142440676/Put/vlen=2/mvcc=0, Jim/course:math/1404142434405/Put/vlen=2/mvcc=0, Jim/grade:/1404142422653/Put/vlen=1/mvcc=0} scala> val rs = res(0)._2
rs: org.apache.hadoop.hbase.client.Result = keyvalues={Jim/course:art/1404142440676/Put/vlen=2/mvcc=0, Jim/course:math/1404142434405/Put/vlen=2/mvcc=0, Jim/grade:/1404142422653/Put/vlen=1/mvcc=0} scala> rs.
asInstanceOf cellScanner containsColumn containsEmptyColumn containsNonEmptyColumn copyFrom
getColumn getColumnCells getColumnLatest getColumnLatestCell getExists getFamilyMap
getMap getNoVersionMap getRow getValue getValueAsByteBuffer isEmpty
isInstanceOf list listCells loadValue raw rawCells
setExists size toString value

遍历这条记录,取出每一个cell的值:

scala> val kv_array = rs.raw
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
kv_array: Array[org.apache.hadoop.hbase.KeyValue] = Array(Jim/course:art/1404142440676/Put/vlen=2/mvcc=0, Jim/course:math/1404142434405/Put/vlen=2/mvcc=0, Jim/grade:/1404142422653/Put/vlen=1/mvcc=0)

遍历记录

scala> for(keyvalue <- kv) println("rowkey:"+ new String(keyvalue.getRow)+ " cf:"+new String(keyvalue.getFamily()) + " column:" + new String(keyvalue.getQualifier) + " " + "value:"+new String(keyvalue.getValue()))
warning: there were 4 deprecation warning(s); re-run with -deprecation for details
rowkey:Jim cf:course column:art value:67
rowkey:Jim cf:course column:math value:77
rowkey:Jim cf:grade column: value:3

查询记录个数

scala> hrdd.count
2014-07-01 01:26:03,133 INFO [main] spark.SparkContext (Logging.scala:logInfo(58)) - Starting job: count at <console>:25
2014-07-01 01:26:03,134 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Got job 5 (count at <console>:25) with 1 output partitions (allowLocal=false)
2014-07-01 01:26:03,134 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Final stage: Stage 5(count at <console>:25)
2014-07-01 01:26:03,134 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Parents of final stage: List()
2014-07-01 01:26:03,135 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Missing parents: List()
2014-07-01 01:26:03,166 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Submitting Stage 5 (NewHadoopRDD[0] at newAPIHadoopRDD at <console>:22), which has no missing parents
2014-07-01 01:26:03,397 INFO [spark-akka.actor.default-dispatcher-16] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Submitting 1 missing tasks from Stage 5 (NewHadoopRDD[0] at newAPIHadoopRDD at <console>:22)
2014-07-01 01:26:03,401 INFO [spark-akka.actor.default-dispatcher-16] scheduler.TaskSchedulerImpl (Logging.scala:logInfo(58)) - Adding task set 5.0 with 1 tasks
2014-07-01 01:26:03,427 INFO [spark-akka.actor.default-dispatcher-16] scheduler.FairSchedulableBuilder (Logging.scala:logInfo(58)) - Added task set TaskSet_5 tasks to pool default
2014-07-01 01:26:03,439 INFO [spark-akka.actor.default-dispatcher-5] scheduler.TaskSetManager (Logging.scala:logInfo(58)) - Starting task 5.0:0 as TID 0 on executor 0: 192.168.2.105 (PROCESS_LOCAL)
2014-07-01 01:26:03,469 INFO [spark-akka.actor.default-dispatcher-5] scheduler.TaskSetManager (Logging.scala:logInfo(58)) - Serialized task 5.0:0 as 1305 bytes in 7 ms
2014-07-01 01:26:11,015 INFO [Result resolver thread-0] scheduler.TaskSetManager (Logging.scala:logInfo(58)) - Finished TID 0 in 7568 ms on 192.168.2.105 (progress: 1/1)
2014-07-01 01:26:11,017 INFO [Result resolver thread-0] scheduler.TaskSchedulerImpl (Logging.scala:logInfo(58)) - Removed TaskSet 5.0, whose tasks have all completed, from pool default
2014-07-01 01:26:11,036 INFO [spark-akka.actor.default-dispatcher-4] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Completed ResultTask(5, 0)
2014-07-01 01:26:11,057 INFO [spark-akka.actor.default-dispatcher-4] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Stage 5 (count at <console>:25) finished in 7.605 s
2014-07-01 01:26:11,067 INFO [main] spark.SparkContext (Logging.scala:logInfo(58)) - Job finished: count at <console>:25, took 7.933270634 s
res71: Long = 3

版本号二、

hrdd.map(tuple => tuple._2).map(result => (result.getRow, result.getColumn("course".getBytes(), "art".getBytes()))).map(row => {
(
row._1.map(_.toChar).mkString,
row._2.asScala.reduceLeft {
(a, b) => if (a.getTimestamp > b.getTimestamp) a else b
}.getValue.map(_.toChar).mkString
)
}).take(10)

这样就能得到row key 和相应 column family的值了。

四、总结
Spark操作Hbase事实上和java client操作Hbas大体流程是一致的,都是客户端去连接HMaster,终于利用java api来操作hbase。

仅仅只是Spark提供了一种与RDD结合的概念,而且利用scala的语法简洁性。提高了编程效率。

——EOF——
原创文章,转载请注明来自:http://blog.csdn.net/oopsoom/article/details/36071323

Spark操作hbase的更多相关文章

  1. spark 操作hbase

    HBase经过七年发展,终于在今年2月底,发布了 1.0.0 版本.这个版本提供了一些让人激动的功能,并且,在不牺牲稳定性的前提下,引入了新的API.虽然 1.0.0 兼容旧版本的 API,不过还是应 ...

  2. Spark操作HBase问题:java.io.IOException: Non-increasing Bloom keys

    1 问题描述 在使用Spark BulkLoad数据到HBase时遇到以下问题: 17/05/19 14:47:26 WARN scheduler.TaskSetManager: Lost task ...

  3. Spark操作HBase报:org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException异常解决方案

    一.异常信息 19/03/21 15:01:52 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 21.0 (TID 14640, hnte ...

  4. spark 对hbase 操作

    本文将分两部分介绍,第一部分讲解使用 HBase 新版 API 进行 CRUD 基本操作:第二部分讲解如何将 Spark 内的 RDDs 写入 HBase 的表中,反之,HBase 中的表又是如何以 ...

  5. Spark读取Hbase中的数据

    大家可能都知道很熟悉Spark的两种常见的数据读取方式(存放到RDD中):(1).调用parallelize函数直接从集合中获取数据,并存入RDD中:Java版本如下: JavaRDD<Inte ...

  6. spark(2.1.0) 操作hbase(1.0.2)

    一.写操作 1.spark中引入外部jar包 1)创建/usr/software/spark_jars目录,把hbase里的lib里的以下七个jar放入/usr/software/spark_jars ...

  7. Spark-读写HBase,SparkStreaming操作,Spark的HBase相关操作

    Spark-读写HBase,SparkStreaming操作,Spark的HBase相关操作 1.sparkstreaming实时写入Hbase(saveAsNewAPIHadoopDataset方法 ...

  8. Spark 下操作 HBase(1.0.0 新 API)

    hbase1.0.0版本提供了一些让人激动的功能,并且,在不牺牲稳定性的前提下,引入了新的API.虽然 1.0.0 兼容旧版本的 API,不过还是应该尽早地来熟悉下新版API.并且了解下如何与当下正红 ...

  9. PySpark操作HBase时设置scan参数

    在用PySpark操作HBase时默认是scan操作,通常情况下我们希望加上rowkey指定范围,即只获取一部分数据参加运算.翻遍了spark的python相关文档,搜遍了google和stackov ...

随机推荐

  1. 使用WPF创建无边框窗体

    一.无边框窗口添加窗口阴影 实际上在WPF中添加无边框窗口的窗口阴影十分简单. 首先,设置WindowStyle="None"以及AllowsTransparency=" ...

  2. 【Linux探索之旅】第二部分第三课:文件和目录,组织不会亏待你

    内容简介 1.第二部分第三课:文件和目录,组织不会亏待你 2.第二部分第四课预告:文件操纵,鼓掌之中 文件和目录,组织不会亏待你 上一次课我们讲了命令行,这将成为伴随我们接下来整个Linux课程的一个 ...

  3. MongoDB学习笔记-基础概念

    mongodb中基本的概念 文档.集合.数据库 与关系数据库的概念对比更容易理解

  4. 设计模式C++达到 1.辛格尔顿

    实现类的单个案件的Singleton模式.该系统有一个类只有一个实例,而本实施例是容易的外部访问.所以容易控制的实例的数量,并且节省系统资源. 单的情况下通常与一些非本地静态对象的使用,对于这些对象, ...

  5. 5月,专用程序猿的经典大作——APUE

    五一小长假刚刚过去,收回我们游走的心.開始你们的读书旅程吧! 本期特别推荐 经典UNIX著作最新版. 20多年来,这本书帮助几代程序猿写出强大.高性能.可靠的代码. 第3版依据当今主流系统进行更新,更 ...

  6. JMS and ActiveMQ first lesson(转)

    JMS and ActiveMQ first lesson -- jms基础概念和应用场景 2011-6-18 PM 9:30 主讲:kimmking <kimmking@163.com> ...

  7. 怎样配置git ssh连接,怎样在GitHub上加入协作开发人员,怎样配置gitignore和怎样在GitHub上删除资源库.

    **********1.在运行git push origin master指令时报例如以下错误: iluckysi@ILUCKYSI-PC /d/ilucky/message/code (master ...

  8. 文件搜索神器everything 你不知道的技巧总结

    everything这个软件用了很久,总结了一些大家可能没注意到的技巧,分享给大家 1.指定文件目录搜索示例: TDDOWNLOAD\ abc        在所有TDDOWNLOAD文件夹下搜索包含 ...

  9. OpenCV——Delaunay三角 [转载]

    从这个博客转载 http://blog.csdn.net/raby_gyl/article/details/17409717 请其它同学转载时注明原始文章的出处! Delaunay三角剖分是1934年 ...

  10. C++ STL简化了编程

     图1.STL和c++标准模板库 作为C++标准必不可少的一部分,STL应该是渗透在C++程序的角角落落里的. STL不是实验室里的宠儿.也不是程序猿桌上的摆设.她的激动人心并不是昙花一现.本教程旨在 ...