dataframe 数据统计可视化---spark scala 应用

统计效果：

代码部分：

import org.apache.spark.sql.hive.HiveContext

import org.apache.spark.{Logging, SparkConf, SparkContext}

import org.apache.spark.sql.{DataFrame, Row, SaveMode, _}

import com.alibaba.fastjson.{JSON, JSONObject}

import org.apache.hadoop.conf.Configuration

import org.apache.hadoop.fs.{FileSystem, Path}

import org.apache.spark.sql.types.StringType

import scala.collection.mutable.ArrayBuffer

/**

  * 功能：对hive表的列信息进行统计。

  * 统计结果包括：

  * 1.包含每列的平均值，中位数，最小值最大值，方差，唯一值，缺失值，列类型。

  * 2.列的直方图分布（字符串top10，数值列10个区间），四分位图分布（数值列）。

  *

  * 实现逻辑：

  * 1.利用spark的describe函数获取到最大值，最小值，均值，方差等。

  * 2。利用sql：获取唯一值及缺失值，sql样例如下：

  *   select count(distinct(id)) as unique_id , count(distinct(name)) as unique_name, sum(case when id is null then 1 else 0 end) as missing_id, sum(case when name is null then 1 else 0 end) as missing_name, sum(1) as totalrows from zpcrcf

  *   结果：

  *   +---------+-----------+----------+------------+---------+

  *   |unique_id|unique_name|missing_id|missing_name|totalrows|

  *   +---------+-----------+----------+------------+---------+

  *   |       14|         12|         0|           0|       14|

  *   +---------+-----------+----------+------------+---------+

  *

  * 3.利用sql：获取四分位图，sql样例如下；

  *    select 'Quartile_id' as  colName, ntil, max(id) as  num from (select id,  ntile(4) OVER (order by id)as ntil from zpcrcf) tt group by ntil

  *    结果：

  *    +------------+----+---+

  *    |     colName|ntil|num|

  *    +------------+----+---+

  *    | Quartile_id|   1|  3|

  *    | Quartile_id|   2|  7|

  *    | Quartile_id|   3| 14|

  *    | Quartile_id|   4|100|

  *

  * 4.数值型直方图10阶段分区间，通过最大值减最小值，获取各个区间内的分布。

  *   sql样例如下：

  *   select 'MathHistogram_age' as colName, partNum, count(1) as num from ( select age, (case  when (age >= 29.0 and age <= 36.1) then 1  when (age > 36.1 and age <= 43.2) then 2  when (age > 43.2 and age <= 50.3) then 3  when (age > 50.3 and age <= 57.4) then 4  when (age > 57.4 and age <= 64.5) then 5  when (age > 64.5 and age <= 71.6) then 6  when (age > 71.6 and age <= 78.69999999999999) then 7  when (age > 78.69999999999999 and age <= 85.8) then 8  when (age > 85.8 and age <= 92.9) then 9  when (age > 92.9 and age <= 100.0) then 10  else 0  end ) as partNum from zpcrcf) temptableScala group by partNum

  *   结果：

  *   +-----------------+-------+---+

  *   |          colName|partNum|num|

  *   +-----------------+-------+---+

  *   |MathHistogram_age|      0|  1|

  *   |MathHistogram_age|      1|  3|

  *   |MathHistogram_age|     10| 10|

  *   | MathHistogram_id|      1| 10|

  *   | MathHistogram_id|      2|  3|

  *   | MathHistogram_id|     10|  1|

  *   +-----------------+-------+---+

  *

  *

  *

  * Created by zpc on 2016/4/26.

  */

object DataFrameVisiualize extends Logging {

  def runforstatistic(hiveContext: HiveContext, params: JSONObject) = {

    val arr = params.getJSONArray("targetType")

    var i = 0

    while( arr != null && i < arr.size()){

      val obj = arr.getJSONObject(i)

      if("dataset".equalsIgnoreCase(obj.getString("targetType"))){

        val tableNameKey = obj.getString("targetName")

        val tableName = params.getString(tableNameKey)

        val user = params.getString("user")

        run(hiveContext, tableName, user)

      }

      i = i+1

    }

  }

  def run(hiveContext: HiveContext, tableName: String, user: String) = {

    val pathParent = s"/user/$user/mlaas/tableStatistic/$tableName"

//    val conf = new SparkConf().setAppName("DataFrameVisiualizeJob")

//    val sc = new SparkContext(conf)

//    val hiveContext = new HiveContext(sc)

//    val sqlContext = new SQLContext(sc)

    //0.获取DB的schema信息

    val schemadf = hiveContext.sql("desc " + tableName)

    //schema信息落地

    val filePathSchema = pathParent + "/schemajson"

    schemadf.write.mode(SaveMode.Overwrite).format("json").save(filePathSchema)

    //1.加载表到dataframe

    val df = hiveContext.sql("select * from " + tableName)

    //2.获取dataframe的describe信息，默认为获取到的都为数值型列

    val dfdesc = df.describe()

    //    //3.描述信息落地

    //    val filePath = pathParent + "/describejson"

    //    des.write.mode(SaveMode.Overwrite).format("json").save(filePath)

    //    val dfdesc = sqlContext.read.format("json").load(filePath)

    //4.列信息区分为mathColArr 和 strColArr

    val mathColArr = dfdesc.columns.filter(!_.equalsIgnoreCase("summary"))

    val (colMin, colMax, colMean, colStddev, colMedian) = getDesfromDF(dfdesc, mathColArr)

    val allColArr = df.columns

    // col type 存在vector类型,此处仅统计string和num类型的

    val typeMap = df.dtypes.toMap

    val strColArr = allColArr.filter(typeMap.get(_).get.equals(StringType.toString))

//    val strColArr = allColArr.filter(!_.equalsIgnoreCase("summary")).diff(mathColArr)

    saveRecords(hiveContext, tableName, 100, pathParent + "/recordsjson")

    val jsonobj = getAllStatistics(hiveContext, tableName, allColArr, strColArr, mathColArr, 10, colMin, colMax)

    jsonobj.put("colMin", colMin)

    jsonobj.put("colMax", colMax)

    jsonobj.put("colMean", colMean)

    jsonobj.put("colStddev", colStddev)

    jsonobj.put("colMedian", colMedian)

    val jsonStr = jsonobj.toString

    val conf1 = new Configuration()

    val fs = FileSystem.get(conf1)

    val fileName = pathParent + "/jsonObj"

    val path = new Path(fileName)

    val hdfsOutStream = fs.create(path)

    hdfsOutStream.write(jsonStr.getBytes("utf-8"))

    hdfsOutStream.flush()

    hdfsOutStream.close()

    //    fs.close();

  }

  def saveRecords(hiveContext: HiveContext, tableName: String, num: Int, filePath: String) : Unit = {

    hiveContext.sql(s"select * from $tableName limit $num").write.mode(SaveMode.Overwrite).format("json").save(filePath)

  }

  /**

    * 根据allCols， mathColArr， strColArr 三个数组，返回带有所有统计信息（除去已经根据describe获取到的）的dataframes。

    * 返回的dataframe结果进行遍历，填充各个属性的值。

    */

  def getAllStatistics(hiveContext: HiveContext, tableName: String, allColArr: Array[String], strColArr: Array[String], mathColArr: Array[String], partNum: Int, colMin: java.util.HashMap[String, Double], colMax: java.util.HashMap[String, Double]) :

  JSONObject = {

    val jsonobj = new JSONObject()

    val sb = new StringBuffer()

    sb.append("select ")

    allColArr.map{col => sb.append(s"count(distinct(`$col`)) as unique_$col ," +

      s"sum(case when `$col` is null then 1 else 0 end) as missing_$col, ")}

    sb.append(s"sum(1) as totalrows from $tableName")

    val df = hiveContext.sql(sb.toString)

    val colUnique = new java.util.HashMap[String, Long]//唯一值

    val colMissing = new java.util.HashMap[String, Long]//缺失值

    var totalrows = 0L

    df.take(1).foreach(row => (totalrows = row.getAs[Long]("totalrows"), jsonobj.put("totalrows", totalrows) ,allColArr.foreach(col => (colUnique.put(col, row.getAs[Long]("unique_"+col)),colMissing.put(col, row.getAs[Long]("missing_"+col))) ) ))

    val dfArr = ArrayBuffer[DataFrame]()

    val strHistogramSql = new StringBuffer()

    strHistogramSql.append(s"""

         SELECT tta.colName, tta.value, tta.num

         FROM (

         SELECT ta.colName, ta.value, ta.num, ROW_NUMBER() OVER (PARTITION BY ta.colName ORDER BY ta.num DESC) AS row

         FROM (

      """)

    var vergin = 0

    for(col <- strColArr){

      if(vergin == 1){

        strHistogramSql.append(" UNION ALL ")

      }

      vergin = 1

      strHistogramSql.append(s"""

      SELECT 'StrHistogram_$col' AS colName, `$col` AS value, COUNT(1) AS num

      FROM $tableName

        GROUP BY `$col` """)

    }

    strHistogramSql.append(s"""

      ) ta

      ) tta

      WHERE tta.row <= $partNum

      """)

    //整个表中，可能不存在字符串型的列。此时，sql是不完整的,添加到df中会报错。

    if(strColArr != null && strColArr.size != 0 ){

      val dfStrHistogram =  hiveContext.sql(strHistogramSql.toString)

      dfArr.append(dfStrHistogram)

    }

    for(col <- mathColArr) {

      val df1 = hiveContext.sql(s"select 'Quartile_$col' as  colName, ntil, bigint(max(`$col`)) as  num from (select `$col`,  ntile(4) OVER (order by `$col`)as ntil from $tableName) tt group by ntil ")

      log.info("col is :" + col + ", min is :" + colMin.get(col) + ", max is : " + colMax.get(col))

      // when the column  data contains null, the min and max may be null or be "Infinity".

      if (colMin == null || colMin.get(col) == null || colMax.get(col) == null || colMax.get(col) == "Infinity" || colMin.get(col) == "Infinity") {

        log.info("col is :" + col + ", min is :" + colMin.get(col) + ", max is : " + colMax.get(col))

      } else {

      //need toString first, then toDouble。 or：ClassCastException

      val min = colMin.get(col).toString.toDouble

      val max = colMax.get(col).toString.toDouble

      val df2 = getHistogramMathDF(col, hiveContext, tableName, min, max, partNum)

      dfArr.append(df1)

      dfArr.append(df2)

      }

    }

    //可能存在没有列可统计的情况， e.g. 表中的列都为double，但数据都是null.

    //dfArr.reduce 和会报错：java.lang.UnsupportedOperationException: empty.reduceLeft

    //总行数为0时，四分位，条形图也自然获取不到，且会出现NullPointerException。

    if(dfArr.isEmpty || totalrows == 0L){

      jsonobj.put("colUnique", colUnique)

      jsonobj.put("colMissing", colMissing)

    }else {

        val dfAll = dfArr.reduce(_.unionAll(_))

        val allRows = dfAll.collect()

        val mathColMapQuartile = new java.util.HashMap[String, Array[java.util.HashMap[String, Long]]] //四分位

        val mathColMapHistogram = new java.util.HashMap[String, Array[java.util.HashMap[String, Long]]] //条形图

        val strColMapHistogram = new java.util.HashMap[String, Array[java.util.HashMap[String, Long]]] //条形图

        val (mathColMapQuartile1, mathColMapHistogram1, strColMapHistogram1) = readRows(allRows)

        for (col <- strColArr) {

          strColMapHistogram.put(col, strColMapHistogram1.get(col).toArray[java.util.HashMap[String, Long]])

        }

        for (col <- mathColArr) {

          mathColMapQuartile.put(col, mathColMapQuartile1.get(col).toArray[java.util.HashMap[String, Long]])

          mathColMapHistogram.put(col, mathColMapHistogram1.get(col).toArray[java.util.HashMap[String, Long]])

        }

        jsonobj.put("mathColMapQuartile", mathColMapQuartile)

        jsonobj.put("mathColMapHistogram", mathColMapHistogram)

        jsonobj.put("strColMapHistogram", strColMapHistogram)

        jsonobj.put("colUnique", colUnique)

        jsonobj.put("colMissing", colMissing)

    }

    jsonobj

  }

  def readRows(rows: Array[Row]) : (java.util.HashMap[String, ArrayBuffer[java.util.HashMap[String,Long]]] , java.util.HashMap[String, ArrayBuffer[java.util.HashMap[String,Long]]], java.util.HashMap[String, ArrayBuffer[java.util.HashMap[String,Long]]])={

    val mathColMapQuartile = new java.util.HashMap[String, ArrayBuffer[java.util.HashMap[String,Long]]] //四分位

    val mathColMapHistogram = new java.util.HashMap[String, ArrayBuffer[java.util.HashMap[String,Long]]]//条形图

    val strColMapHistogram = new java.util.HashMap[String, ArrayBuffer[java.util.HashMap[String,Long]]]//条形图

    rows.foreach( row => {

      val colName = row.getAs[String]("colName")

      if (colName.startsWith("StrHistogram")) {

        val value = row.getAs[String](1)

        val num = row.getAs[Long](2)

        val map = new java.util.HashMap[String, Long]()

        val col = colName.substring(colName.indexOf('_') + 1)

        map.put(value, num)

        val mapValue = strColMapHistogram.get(col)

        if (mapValue == null) {

          val mapValueNew = ArrayBuffer[java.util.HashMap[String, Long]]()

          mapValueNew.append(map)

          strColMapHistogram.put(col, mapValueNew)

        } else {

          mapValue.append(map)

          strColMapHistogram.put(col, mapValue)

        }

      } else if (colName.toString.startsWith("Quartile")) {

        val value = row.get(1).toString

        val num = row.getAs[Long](2)

        val map = new java.util.HashMap[String, Long]()

        val col = colName.substring(colName.indexOf('_') + 1)

        map.put(value, num)

        val mapValue = mathColMapQuartile.get(col)

        if (mapValue == null) {

          val mapValueNew = ArrayBuffer[java.util.HashMap[String, Long]]()

          mapValueNew.append(map)

          mathColMapQuartile.put(col, mapValueNew)

        } else {

          mapValue.append(map)

          mathColMapQuartile.put(col, mapValue)

        }

      } else if (colName.toString.startsWith("MathHistogram")) {

        val value =row.get(1).toString

        val num = row.getAs[Long](2)

        val map = new java.util.HashMap[String, Long]()

        val col = colName.substring(colName.indexOf('_') + 1)

        map.put(value, num)

        val mapValue = mathColMapHistogram.get(col)

        if (mapValue == null) {

          val mapValueNew = ArrayBuffer[java.util.HashMap[String, Long]]()

          mapValueNew.append(map)

          mathColMapHistogram.put(col, mapValueNew)

        } else {

          mapValue.append(map)

          mathColMapHistogram.put(col, mapValue)

        }

      }

    })

    (mathColMapQuartile, mathColMapHistogram, strColMapHistogram)

  }

  /** 数值型的列的条形分布获取方法*/

  def getHistogramMathDF(col : String, hiveContext: HiveContext, tableName: String, min: Double, max: Double, partNum: Int) : DataFrame = {

    val len = (max - min) / partNum

    log.info(s"len is : $len")

    val sb = new StringBuffer()

    sb.append(s"select `$col`, (case ")

    val firstRight = min + len

    sb.append(s" when (`$col` >= $min and `$col` <= $firstRight) then 1 ")

    for (i <- 2 until (partNum + 1)) {

      val left = min + len * (i - 1)

      val right = min + len * i

      sb.append(s" when (`$col` > $left and `$col` <= $right) then $i ")

    }

    sb.append(s" else 0  end ) as partNum from $tableName")

    sb.insert(0, s"select 'MathHistogram_$col' as colName, partNum, count(1) as num from ( ")

    sb.append(") temptableScala group by partNum")

    log.info("getHistogram is: " + sb.toString)

    val df = hiveContext.sql(sb.toString)

    df

  }

  def getDesfromDF(dfdesc : DataFrame, mathColArr: Array[String]):

  (java.util.HashMap[String, Double], java.util.HashMap[String, Double], java.util.HashMap[String, Double], java.util.HashMap[String, Double], java.util.HashMap[String, Double])= {

      val allRows = dfdesc.collect()

    val colMin = new java.util.HashMap[String, Double]//最小值

    val colMax = new java.util.HashMap[String, Double]//最大值

    val colMean = new java.util.HashMap[String, Double]//平均值

    val colStddev = new java.util.HashMap[String, Double]//标准差

    val colMedian = new java.util.HashMap[String, Double]//中位值

    allRows.foreach(row => {

      val mapKey = row.getAs[String]("summary")

      for(col <- mathColArr){

        if("mean".equalsIgnoreCase(mapKey)){

          colMean.put(col, row.getAs[Double](col))

        }else if("stddev".equalsIgnoreCase(mapKey)){

          colStddev.put(col, row.getAs[Double](col))

        }else if("min".equalsIgnoreCase(mapKey)){

          log.info("col is " + col +", min is : "+ row.getAs[Double](col))

          colMin.put(col, row.getAs[Double](col))

        }else if("max".equalsIgnoreCase(mapKey)){

          log.info("col is " + col +", max is : "+ row.getAs[Double](col))

          colMax.put(col, row.getAs[Double](col))

        }else{

          colMedian.put(col, row.getAs[Double](col))

        }

      }

    })

    (colMin, colMax, colMean, colStddev, colMedian)

  }

}

dataframe 数据统计可视化---spark scala 应用的更多相关文章

新闻网大数据实时分析可视化系统项目——18、Spark SQL快速离线数据分析
1.Spark SQL概述 1)Spark SQL是Spark核心功能的一部分,是在2014年4月份Spark1.0版本时发布的. 2)Spark SQL可以直接运行SQL或者HiveQL语句 3)B ...
大数据学习day34---spark14------1 redis的事务(pipeline)测试，2. 利用redis的pipeline实现数据统计的exactlyonce ，3 SparkStreaming中数据写入Hbase实现ExactlyOnce， 4.Spark StandAlone的执行模式，5 spark on yarn
1 redis的事务(pipeline)测试 Redis本身对数据进行操作,单条命令是原子性的,但事务不保证原子性,且没有回滚.事务中任何命令执行失败,其余的命令仍会被执行,将Redis的多个操作放到 ...
spark 将dataframe数据写入Hive分区表
从spark1.2 到spark1.3,spark SQL中的SchemaRDD变为了DataFrame,DataFrame相对于SchemaRDD有了较大改变,同时提供了更多好用且方便的API.Da ...
大数据框架：Spark vs Hadoop vs Storm
大数据时代,TB级甚至PB级数据已经超过单机尺度的数据处理,分布式处理系统应运而生. 知识预热「专治不明觉厉」之“大数据”: 大数据生态圈及其技术栈: 关于大数据的四大特征(4V) 海量的数据规模( ...
大数据篇：Spark
大数据篇:Spark Spark是什么 Spark是一个快速(基于内存),通用,可扩展的计算引擎,采用Scala语言编写.2009年诞生于UC Berkeley(加州大学伯克利分校,CAL的AMP实验 ...
大数据开发认知--spark
1. Spark rdd生成过程· Spark的任务调度分为四步 1RDD objects RDD的准备阶段,组织RDD及RDD的依赖关系生成大概的RDD的DAG图,DAG图是有向环图. 2DAG s ...
新闻网大数据实时分析可视化系统项目——14、Spark2.X环境准备、编译部署及运行
1.Spark概述 Spark 是一个用来实现快速而通用的集群计算的平台. 在速度方面, Spark 扩展了广泛使用的 MapReduce 计算模型,而且高效地支持更多计算模式,包括交互式查询和流处理 ...
大数据学习day33----spark13-----1.两种方式管理偏移量并将偏移量写入redis 2. MySQL事务的测试 3.利用MySQL事务实现数据统计的ExactlyOnce（sql语句中出现相同key时如何进行累加（此处时出现相同的单词））4 将数据写入kafka
1.两种方式管理偏移量并将偏移量写入redis (1)第一种:rdd的形式一般是使用这种直连的方式,但其缺点是没法调用一些更加高级的api,如窗口操作.如果想更加精确的控制偏移量,就使用这种方式代 ...
11，SFDC 管理员篇 - 报表和数据的可视化
1,Report Builder 1,每一个report type 都有一个 primay object 和多个相关的object 2,Primary object with related obje ...

随机推荐

SQL技术内幕三
Select 分析一个查询实例 Select empid,year(orderdate) as orderYear,count(*) as orderCount From dbo.orderInfo ...
(用微信扫的静态链接二维码)微信native支付模式官方提供的demo文件中的几个bug修正
native支付模式一demo(用微信扫的静态链接二维码)BUG修复,一共4个BUG 1.native_call_qrcode.php这个文件中的代码无法生存native支付的短地址2.WxPayPu ...
【3】Bootstrap的下载和目录结构
[1]下载去中方官网下载http://www.bootcss.com/ 如果你是做网页练习,你可以使用CDN加速服务,免去下载等痛苦,当然你使用的时候必须有连接上网络.中方的官网也提供了很多种类的C ...
struts2 修改action的后缀
struts2 修改action的后缀 struts2 的默认后缀是 .action 虽然很直观,但是很烦琐.很多人喜欢将请求的后缀改为 .do 在struts2中修改action后缀有两种比较简单的 ...
mirantis fuel puppet执行顺序和对整个项目代码的执行流程理解
stage执行顺序 stage {'zero': } -> stage {'first': } -> stage {'openstack-custom-repo': } -> sta ...
SaveFileDialog控件
http://msdn.microsoft.com/zh-cn/library/system.windows.forms.savefiledialog.aspx 1,SaveFileDialog控件的 ...
sbrk and coreleft
一.sbrk 函数来源:TC2.0.Linux 函数名: sbrk 功能: 增加程序可用数据段空间,增加大小由参数 incr决定 . 返回值:函数调用成功返回一指针,指向新的内存空间.函数调用失败则 ...
Android 虚拟机Dalvik、Android各种java包功能、Android相关文件类型、应用程序结构分析、ADB
Android虚拟机Dalvik Dalvik冲击随着Google 的AndroidSDK 的发布,关于它的API 以及在移动电话领域所带来的预期影响这些方面的讨论不胜枚举.不过,其中的一个话题在J ...
Linux系统架设支持自助开通Shado wsocks及VPN前端的教程
程序实现:通过网页端注册,自助开通VPN帐号及Shadowsocks帐号.并可实现流量统计系统要求 Debian 6 x64 纯净系统 by: Lop ①配置环境 apt-get updateapt ...
上下切换js
<div class="wview"> <span class="prevs" id="prevs-j"></ ...

dataframe 数据统计可视化---spark scala 应用

dataframe 数据统计可视化---spark scala 应用的更多相关文章

随机推荐

热门专题