spark读取hbase形成RDD，存入hive或者spark

object SaprkReadHbase {

    var total:Int = 0

    def main(args: Array[String]) {

      val spark = SparkSession

        .builder()

        .master("local[2]")

        .appName("Spark Read  Hbase ")

        .enableHiveSupport()    //如果要读取hive的表，就必须使用这个

        .getOrCreate()

     val sc= spark.sparkContext

//zookeeper信息设置，存储着hbase的元信息

      val conf = HBaseConfiguration.create()

      conf.set("hbase.zookeeper.quorum","hadoop01,hadoop02,hadoop03")

      conf.set("hbase.zookeeper.property.clientPort", "")

      conf.set(TableInputFormat.INPUT_TABLE, "event_logs_20190218")

      //读取数据并转化成rdd

      val hBaseRDD: RDD[(ImmutableBytesWritable, Result)] = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],

        classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], //定义输入格式

        classOf[org.apache.hadoop.hbase.client.Result]) //定义输出

      val count = hBaseRDD.count()

      println("\n\n\n:" + count)

      import spark.implicits._

    val logRDD: RDD[EventLog] = hBaseRDD.map{case (_,result) =>{

        //获取行键v

        val rowKey = Bytes.toString(result.getRow)

       val api_v=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("api_v")))

        val app_id=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("app_id")))

        val c_time=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("c_time")))

        val ch_id=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("ch_id")))

        val city=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("city")))

        val province=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("province")))

        val country=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("country")))

        val en=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("en")))

        val ip=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("ip")))

        val net_t=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("net_t")))

        val pl=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("pl")))

        val s_time=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("s_time")))

        val user_id=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("user_id")))

        val uuid=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("uuid")))

        val ver=Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes("ver")))
//样例类进行schemal信息构建。元组与样例类的字段值据说不能超过22个，一般structureType构建（row，schemal）

    new EventLog(rowKey,api_v,app_id,c_time,ch_id,city,province,country,en,ip,net_t,pl,s_time,user_id,uuid,ver)

      }

      }

//可以转为dataframe、dataset存入hive作为宽表 或者直接进行sparkcore分析

     val logds= logRDD.toDS()

      logds.createTempView("event_logs")

    val sq=  spark.sql("select * from event_logs limit 1")

      println(sq.explain())

      sq.show()

      sc.stop()

      spark.stop()

    }

  }

//write hbase

/**
  * @created by imp ON 2018/2/19
  */
class SparkWriteHbase {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local")
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR")
    val conf = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.quorum", "hadoop01,hadoop02,hadoop03")
    conf.set("hbase.zookeeper.property.clientPort", "2181")
    conf.set(TableOutputFormat.OUTPUT_TABLE, "test")
    val job = new Job(conf)
    job.setOutputKeyClass(classOf[ImmutableBytesWritable])
    job.setOutputValueClass(classOf[Result])
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

    var arrResult: Array[String] = new Array[String](1)
    arrResult(0) = "1, 3000000000";
    //arrResult(0) = "1,100,11"

    val resultRDD = sc.makeRDD(arrResult)
    val saveRDD = resultRDD.map(_.split(',')).map { arr => {
      val put = new Put(Bytes.toBytes(arr(0)))
      put.add(Bytes.toBytes("info"), Bytes.toBytes("total"), Bytes.toBytes(arr(1)))
      (new ImmutableBytesWritable, put)
    }
    }
    println("getConfiguration")
    var c = job.getConfiguration()
    println("save")
    saveRDD.saveAsNewAPIHadoopDataset(c)

    sc.stop()
    //  spark.stop()
  }

}

spark读取hbase形成RDD，存入hive或者spark_sql分析的更多相关文章

Spark读取Hbase中的数据
大家可能都知道很熟悉Spark的两种常见的数据读取方式(存放到RDD中):(1).调用parallelize函数直接从集合中获取数据,并存入RDD中:Java版本如下: JavaRDD<Inte ...
Spark 读取HBase和SolrCloud数据
Spark1.6.2读取SolrCloud 5.5.1 //httpmime-4.4.1.jar // solr-solrj-5.5.1.jar //spark-solr-2.2.2-20161007 ...
Spark 读取HBase数据
Spark1.6.2 读取 HBase 1.2.3 //hbase-common-1.2.3.jar //hbase-protocol-1.2.3.jar //hbase-server-1.2.3.j ...
spark读取hbase(NewHadoopAPI 例子)
package cn.piesat.controller import java.text.{DecimalFormat, SimpleDateFormat}import java.utilimpor ...
Spark读取HBase
背景:公司有些业务需求是存储在HBase上的,总是有业务人员找我要各种数据,所以想直接用Spark( shell) 加载到RDD进行计算摘要: 1.相关环境 2.代码例子内容 1.相关环境 Spa ...
spark读取hbase数据
def main(args: Array[String]): Unit = { val hConf = HBaseConfiguration.create(); hConf.set("hba ...
Spark读取Hbase的数据
val conf = HBaseConfiguration.create() conf.addResource(new Path("/opt/cloudera/parcels/CDH-5.4 ...
Spark整合HBase,Hive
背景: 场景需求1:使用spark直接读取HBASE表场景需求2:使用spark直接读取HIVE表场景需求3:使用spark读取HBASE在Hive的外表摘要: 1.背景 2.提交脚本内容场 ...
spark大批量读取Hbase时出现java.lang.OutOfMemoryError: unable to create new native thread
这个问题我去网上搜索了一下,发现了很多的解决方案都是增加的nproc数量,即用户最大线程数的数量,但我修改了并没有解决问题,最终是通过修改hadoop集群的最大线程数解决问题的. 并且网络上的回答多数 ...

随机推荐

001-mock.js安装使用
一.基础 1.1.安装 //安装 npm install mockjs --save 1.2.使用 // 使用 Mock var Mock = require('mockjs') Mock.mock( ...
push到Git时常见的失败
之前学用git的时候,不想记命令,总是gui和bash交互的用,但是发现总出现push失败的问题,用gui来fetch的时候,显示下拉成功,但事实上并没有,这时候得在bash上用命令来下 ...
小睿开始呼叫用户,然后FS怎么跟用户交互的整个流程原理
学习从小睿开始呼叫用户,然后FS怎么跟用户交互的整个流程原理; 1.小睿向欣方新发起呼叫请求; 2.欣方新可以通过线路发起SIP协议请求,来呼叫用户; 3.当用户接通后,将建立 ...
多态使用时，父类多态时需要使用子类特有对象。需要判断就使用instanceof
instanceof:通常在向下转型前用于健壮性的判断,判断是符合哪一个子类对象 package Polymorphic; public class TestPolymorphic { public ...
dblink连接操作远程数据库
在一个数据库中需要操作远程数据库时,需要创建远程数据库的连接. 连接代码如下: create public database link 连接名 connect to 远程数据库用户名 identifi ...
MJExtension代码解释
Runtime 是什么? objective-C会把函数调用的转换为消息发送,objc_MsgSend(receiver, msg), 注意,recevier指的是消息的接受者.那么self, sup ...
js模拟链表---双向链表
双向链表: 每个元素,有一个 next(指向下一个元素)和一个prev(指向前一个元素) function dbLinkedList(){ var length=0; var head = null; ...
实现多线程异步自动上传本地文件到 Amazon S3
最近抽空做个小工具,使用AWSSDK 对本地文件目录监控,并自动同步上传文件到S3 的过程,使用的是多线程异步上传,针对大文件进行了分块参考文献: https://www.codeproject.c ...
git时光机操作
A状态:代码版本A B状态:代码版本B(比A状态时增加了图片.代码) 这时,git add. git commit -m"" .push之前,意识到忘了让git忽略图片的添加,就: ...
html5-字体css
#div1{font-size: 50px;}#div2{font-size: 50%;}#div3{font-size: 300%}#div4{font-size: 3em;}#div5{font- ...

spark读取hbase形成RDD，存入hive或者spark_sql分析

spark读取hbase形成RDD，存入hive或者spark_sql分析的更多相关文章

随机推荐

热门专题