关于自定义sparkSQL数据源(Hbase)操作中遇到的坑
自定义sparkSQL数据源的过程中,需要对sparkSQL表的schema和Hbase表的schema进行整合;
对于spark来说,要想自定义数据源,你可以实现这3个接口:
BaseRelation 代表了一个抽象的数据源。该数据源由一行行有着已知schema的数据组成(关系表)。
TableScan 用于扫描整张表,将数据返回成RDD[Row]。
RelationProvider 顾名思义,根据用户提供的参数返回一个数据源(BaseRelation)。
当然,TableScan其实是最粗粒度的查询,代表一次性扫描整张表,如果有需求,更细粒度在数据源处过滤掉数据,可以实现:
PrunedScan:可以列剪枝
PrunedFilteredScan:列剪枝 + 过滤
所以,如果对接Hbase的话,就定义一个Hbase的relation
class DefaultSource extends RelationProvider {
def createRelation(sqlContext: SQLContext, parameters: Map[String, String]) = {
HBaseRelation(parameters)(sqlContext)
}
}
case class HBaseRelation(@transient val hbaseProps: Map[String,String])(@transient val sqlContext: SQLContext) extends BaseRelation with Serializable with TableScan{ val hbaseTableName = hbaseProps.getOrElse("hbase_table_name", sys.error("not valid schema"))
val hbaseTableSchema = hbaseProps.getOrElse("hbase_table_schema", sys.error("not valid schema"))
val registerTableSchema = hbaseProps.getOrElse("sparksql_table_schema", sys.error("not valid schema"))
val rowRange = hbaseProps.getOrElse("row_range", "->")
//get star row and end row
val range = rowRange.split("->",-1)
val startRowKey = range(0).trim
val endRowKey = range(1).trim
val tempHBaseFields = extractHBaseSchema(hbaseTableSchema) //do not use this, a temp field
val registerTableFields = extractRegisterSchema(registerTableSchema)
val tempFieldRelation = tableSchemaFieldMapping(tempHBaseFields,registerTableFields)
val hbaseTableFields = feedTypes(tempFieldRelation)
val fieldsRelations = tableSchemaFieldMapping(hbaseTableFields,registerTableFields)
val queryColumns = getQueryTargetCloumns(hbaseTableFields)
def feedTypes( mapping: Map[HBaseSchemaField, RegisteredSchemaField]) : Array[HBaseSchemaField] = {
val hbaseFields = mapping.map{
case (k,v) =>
val field = k.copy(fieldType=v.fieldType)
field
}
hbaseFields.toArray
} def isRowKey(field: HBaseSchemaField) : Boolean = {
val cfColArray = field.fieldName.split(":",-1)
val cfName = cfColArray(0)
val colName = cfColArray(1)
if(cfName=="" && colName=="key") true else false
} def getQueryTargetCloumns(hbaseTableFields: Array[HBaseSchemaField]): String = {
var str = ArrayBuffer[String]()
hbaseTableFields.foreach{ field=>
if(!isRowKey(field)) {
str.append(field.fieldName)
}
}
println(str.mkString(" "))
str.mkString(" ")
}
lazy val schema = {
val fields = hbaseTableFields.map{ field=>
val name = fieldsRelations.getOrElse(field, sys.error("table schema is not match the definition.")).fieldName
val relatedType = field.fieldType match {
case "String" =>
SchemaType(StringType,nullable = false)
case "Int" =>
SchemaType(IntegerType,nullable = false)
case "Long" =>
SchemaType(LongType,nullable = false)
case "Double" =>
SchemaType(DoubleType,nullable = false) }
StructField(name,relatedType.dataType,relatedType.nullable)
}
StructType(fields)
} def tableSchemaFieldMapping( externalHBaseTable: Array[HBaseSchemaField], registerTable : Array[RegisteredSchemaField]): Map[HBaseSchemaField, RegisteredSchemaField] = {
if(externalHBaseTable.length != registerTable.length) sys.error("columns size not match in definition!")
val rs: Array[(HBaseSchemaField, RegisteredSchemaField)] = externalHBaseTable.zip(registerTable)
rs.toMap
} /**
* spark sql schema will be register
* registerTableSchema '(rowkey string, value string, column_a string)'
*/
def extractRegisterSchema(registerTableSchema: String) : Array[RegisteredSchemaField] = {
val fieldsStr = registerTableSchema.trim.drop(1).dropRight(1)
val fieldsArray = fieldsStr.split(",").map(_.trim)//sorted
fieldsArray.map{ fildString =>
val splitedField = fildString.split("\\s+", -1)//sorted
RegisteredSchemaField(splitedField(0), splitedField(1))
}
} def extractHBaseSchema(externalTableSchema: String) : Array[HBaseSchemaField] = {
val fieldsStr = externalTableSchema.trim.drop(1).dropRight(1)
val fieldsArray = fieldsStr.split(",").map(_.trim)
fieldsArray.map(fildString => HBaseSchemaField(fildString,""))
} // By making this a lazy val we keep the RDD around, amortizing the cost of locating splits.
lazy val buildScan = { val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", GlobalConfigUtils.hbaseQuorem)
hbaseConf.set(TableInputFormat.INPUT_TABLE, hbaseTableName)
hbaseConf.set(TableInputFormat.SCAN_COLUMNS, queryColumns)
hbaseConf.set(TableInputFormat.SCAN_ROW_START, startRowKey)
hbaseConf.set(TableInputFormat.SCAN_ROW_STOP, endRowKey) val hbaseRdd = sqlContext.sparkContext.newAPIHadoopRDD(
hbaseConf,
classOf[org.apache.hadoop.hbase.mapreduce.TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result]
) val rs = hbaseRdd.map(tuple => tuple._2).map(result => {
var values = new ArrayBuffer[Any]()
hbaseTableFields.foreach{field=>
values += Resolver.resolve(field,result)
}
Row.fromSeq(values.toSeq)
})
rs
} private case class SchemaType(dataType: DataType, nullable: Boolean)
}
HBaseRelation
Hbase的schema:
package object hbase { abstract class SchemaField extends Serializable case class RegisteredSchemaField(fieldName: String, fieldType: String) extends SchemaField with Serializable case class HBaseSchemaField(fieldName: String, fieldType: String) extends SchemaField with Serializable case class Parameter(name: String)
//sparksql_table_schema
protected val SPARK_SQL_TABLE_SCHEMA = Parameter("sparksql_table_schema")
protected val HBASE_TABLE_NAME = Parameter("hbase_table_name")
protected val HBASE_TABLE_SCHEMA = Parameter("hbase_table_schema")
protected val ROW_RANGE = Parameter("row_range") /**
* Adds a method, `hbaseTable`, to SQLContext that allows reading data stored in hbase table.
*/
implicit class HBaseContext(sqlContext: SQLContext) {
def hbaseTable(sparksqlTableSchema: String, hbaseTableName: String, hbaseTableSchema: String, rowRange: String = "->") = {
var params = new HashMap[String, String]
params += ( SPARK_SQL_TABLE_SCHEMA.name -> sparksqlTableSchema)
params += ( HBASE_TABLE_NAME.name -> hbaseTableName)
params += ( HBASE_TABLE_SCHEMA.name -> hbaseTableSchema)
//get star row and end row
params += ( ROW_RANGE.name -> rowRange)
sqlContext.baseRelationToDataFrame(HBaseRelation(params)(sqlContext))
}
}
}
当然了,其中schema的数据类型也得处理下:
object Resolver extends Serializable {
def resolve (hbaseField: HBaseSchemaField, result: Result ): Any = {
val cfColArray = hbaseField.fieldName.split(":",-1)
val cfName = cfColArray(0)
val colName = cfColArray(1)
var fieldRs: Any = null
//resolve row key otherwise resolve column
if(cfName=="" && colName=="key") {
fieldRs = resolveRowKey(result, hbaseField.fieldType)
} else {
fieldRs = resolveColumn(result, cfName, colName,hbaseField.fieldType)
}
fieldRs
} def resolveRowKey (result: Result, resultType: String): Any = {
val rowkey = resultType match {
case "String" =>
result.getRow.map(_.toChar).mkString
case "Int" =>
result .getRow.map(_.toChar).mkString.toInt
case "Long" =>
result.getRow.map(_.toChar).mkString.toLong
case "Float" =>
result.getRow.map(_.toChar).mkString.toLong
case "Double" =>
result.getRow.map(_.toChar).mkString.toDouble
}
rowkey
} def resolveColumn (result: Result, columnFamily: String, columnName: String, resultType: String): Any = { val column = result.containsColumn(columnFamily.getBytes, columnName.getBytes) match{
case true =>{
resultType match {
case "String" =>
Bytes.toString(result.getValue(columnFamily.getBytes,columnName.getBytes))
case "Int" =>
Bytes.toInt(result.getValue(columnFamily.getBytes,columnName.getBytes))
case "Long" =>
Bytes.toLong(result.getValue(columnFamily.getBytes,columnName.getBytes))
case "Float" =>
Bytes.toFloat(result.getValue(columnFamily.getBytes,columnName.getBytes))
case "Double" =>
Bytes.toDouble(result.getValue(columnFamily.getBytes,columnName.getBytes)) }
}
case _ => {
resultType match {
case "String" =>
""
case "Int" =>
0
case "Long" =>
0
case "Double" =>
0.0
}
}
}
column
}
}
Resolver
做个测试:
object CustomHbaseTest {
def main(args: Array[String]): Unit = {
val startTime = System.currentTimeMillis()
val sparkConf: SparkConf = new SparkConf()
.setMaster("local[6]")
.setAppName("query")
.set("spark.worker.timeout" , GlobalConfigUtils.sparkWorkTimeout)
.set("spark.cores.max" , GlobalConfigUtils.sparkMaxCores)
.set("spark.rpc.askTimeout" , GlobalConfigUtils.sparkRpcTimeout)
.set("spark.task.macFailures" , GlobalConfigUtils.sparkTaskMaxFailures)
.set("spark.speculation" , GlobalConfigUtils.sparkSpeculation)
.set("spark.driver.allowMutilpleContext" , GlobalConfigUtils.sparkAllowMutilpleContext)
.set("spark.serializer" , GlobalConfigUtils.sparkSerializer)
.set("spark.buffer.pageSize" , GlobalConfigUtils.sparkBuferSize)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.driver.host", "localhost")
val sparkSession: SparkSession = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport() //开启支持hive
.getOrCreate()
var hbasetable = sparkSession
.read
.format("com.df.test_custom.customSource")
.options(
Map(
"sparksql_table_schema" -> "(id String, create_time String , open_lng String , open_lat String , begin_address_code String , charge_mileage String , city_name String , vehicle_license String)",
"hbase_table_name" -> "order_info",
"hbase_table_schema" -> "(MM:id , MM:create_time , MM:open_lng , MM:open_lat , MM:begin_address_code , MM:charge_mileage , MM:city_name , MM:vehicle_license)"
)).load() hbasetable.createOrReplaceTempView("orderData") sparkSession.sql(
"""
|select * from orderData
""".stripMargin).show()
val endTime = System.currentTimeMillis()
println(s"花费时间:${endTime - startTime}")
}
}
test
所有代码整合完毕之后,跑通了,但是确发现查询出来的数据和具体的列值对不上
比如:
var hbasetable = sparkSession
.read
.format("com.df.test_custom.customSource")
.options(
Map(
"sparksql_table_schema" -> "(id String, create_time String , open_lng String , open_lat String , begin_address_code String , charge_mileage String , city_name String , vehicle_license String)",
"hbase_table_name" -> "order_info",
"hbase_table_schema" -> "(MM:id , MM:create_time , MM:open_lng , MM:open_lat , MM:begin_address_code , MM:charge_mileage , MM:city_name , MM:vehicle_license)"
)).load()
我指定的sparkSQL表的schema和Hbase的schema如上面的代码;
但是我查询出来的数据是这样的:
hbasetable.createOrReplaceTempView("orderData") sparkSession.sql(
"""
|select * from orderData
""".stripMargin).show()
从上面的图可以看到,其实好多列的顺序对不上了!
问题所在的原因:
def tableSchemaFieldMapping( externalHBaseTable: Array[HBaseSchemaField], registerTable : Array[RegisteredSchemaField]): Map[HBaseSchemaField, RegisteredSchemaField] = {
if(externalHBaseTable.length != registerTable.length) sys.error("columns size not match in definition!")
val rs: Array[(HBaseSchemaField, RegisteredSchemaField)] = externalHBaseTable.zip(registerTable) rs.toMap
}
可以看到,最后是----------> rs.toMap
您注意了,scala中的这个map是不能保证顺序的,举个栗子:
object TestMap {
def main(args: Array[String]): Unit = {
val arr1 = Array("java" , "scla" , "javascripe" , "ii" , "wqe" , "qaz")
val arr2 = Array("java" , "scla" , "javascripe" , "ii" , "wqe" , "qaz")
val toMap: Map[String, String] = arr1.zip(arr2).toMap
for((k,v) <- toMap){
println(s"k :${k} , v:${v}")
}
}
}
结果是这样的:
明显发现,这个结果没按照最初zip后的顺序来,问题其实就是在toMap这里
解决:
在jdk1.5之后,给出了一个可以保持插入顺序强相关的Map,就是 :LinkedHashMap
所以说,解决方案就是,将scala中的Map转成LinkedHashMap
1):修改feedTypes
def feedTypes( mapping: util.LinkedHashMap[HBaseSchemaField, RegisteredSchemaField]) : Array[HBaseSchemaField] = {
val hbaseFields = mapping.map{
case (k,v) =>
val field = k.copy(fieldType=v.fieldType)
field
}
hbaseFields.toArray
} // def feedTypes( mapping: Map[HBaseSchemaField, RegisteredSchemaField]) : Array[HBaseSchemaField] = {
// val hbaseFields = mapping.map{
// case (k,v) =>
// val field = k.copy(fieldType=v.fieldType)
// field
// }
// hbaseFields.toArray
// }
2):修改tableSchemaFieldMapping
def tableSchemaFieldMapping( externalHBaseTable: Array[HBaseSchemaField], registerTable : Array[RegisteredSchemaField]): util.LinkedHashMap[HBaseSchemaField, RegisteredSchemaField] = {
if(externalHBaseTable.length != registerTable.length) sys.error("columns size not match in definition!")
val rs: Array[(HBaseSchemaField, RegisteredSchemaField)] = externalHBaseTable.zip(registerTable)
val linkedHashMap = new util.LinkedHashMap[HBaseSchemaField, RegisteredSchemaField]()
for(arr <- rs){
linkedHashMap.put(arr._1 , arr._2)
}
linkedHashMap
} // def tableSchemaFieldMapping( externalHBaseTable: Array[HBaseSchemaField], registerTable : Array[RegisteredSchemaField]): Map[HBaseSchemaField, RegisteredSchemaField] = {
// if(externalHBaseTable.length != registerTable.length) sys.error("columns size not match in definition!")
// val rs: Array[(HBaseSchemaField, RegisteredSchemaField)] = externalHBaseTable.zip(registerTable)
// rs.toMap
// }
然后在跑test代码:结果
跑通!!!
PS:直接赋值我的代码就能用了
另外:
var hbasetable = sparkSession
.read
.format("com.df.test_custom.customSource")
.options(
Map(
"sparksql_table_schema" -> "(id String, create_time String , open_lng String , open_lat String , begin_address_code String , charge_mileage String , city_name String , vehicle_license String)",
"hbase_table_name" -> "order_info",
"hbase_table_schema" -> "(MM:id , MM:create_time , MM:open_lng , MM:open_lat , MM:begin_address_code , MM:charge_mileage , MM:city_name , MM:vehicle_license)"
)).load()
sparksql_table_schema和hbase_table_schema 顺序必须一样
关于自定义sparkSQL数据源(Hbase)操作中遇到的坑的更多相关文章
- Spark(四): Spark-sql 读hbase
SparkSQL是指整合了Hive的spark-sql cli, 本质上就是通过Hive访问HBase表,具体就是通过hive-hbase-handler, 具体配置参见:Hive(五):hive与h ...
- Spark SQL 编程API入门系列之SparkSQL数据源
不多说,直接上干货! SparkSQL数据源:从各种数据源创建DataFrame 因为 spark sql,dataframe,datasets 都是共用 spark sql 这个库的,三者共享同样的 ...
- 第4章 SparkSQL数据源
第4章 SparkSQL数据源 4.1 通用加载/保存方法 4.1.1 手动指定选项 Spark SQL的DataFrame接口支持多种数据源的操作.一个DataFrame可以进行RDDs方式的操作, ...
- DB数据源之SpringBoot+Mybatis踏坑过程实录系列(一)
DB数据源之SpringBoot+MyBatis踏坑过程(一) liuyuhang原创,未经允许进制转载 系列目录 DB数据源之SpringBoot+Mybatis踏坑过程实录(一) DB数据源之Sp ...
- DB数据源之SpringBoot+MyBatis踏坑过程(二)手工配置数据源与加载Mapper.xml扫描
DB数据源之SpringBoot+MyBatis踏坑过程(二)手工配置数据源与加载Mapper.xml扫描 liuyuhang原创,未经允许进制转载 吐槽之后应该有所改了,该方式可以作为一种过渡方式 ...
- DB数据源之SpringBoot+MyBatis踏坑过程(三)手工+半自动注解配置数据源与加载Mapper.xml扫描
DB数据源之SpringBoot+MyBatis踏坑过程(三)手工+半自动注解配置数据源与加载Mapper.xml扫描 liuyuhang原创,未经允许禁止转载 系列目录连接 DB数据源之Spr ...
- DB数据源之SpringBoot+MyBatis踏坑过程(四)没有使用连接池的后果
DB数据源之SpringBoot+MyBatis踏坑过程(四)没有使用连接池的后果 liuyuhang原创,未经允许禁止转载 系列目录连接 DB数据源之SpringBoot+Mybatis踏坑过程实 ...
- DB数据源之SpringBoot+MyBatis踏坑过程(五)手动使用Hikari连接池
DB数据源之SpringBoot+MyBatis踏坑过程(五)手动使用Hikari连接池 liuyuhang原创,未经允许禁止转载 系列目录连接 DB数据源之SpringBoot+Mybatis踏坑 ...
- DB数据源之SpringBoot+MyBatis踏坑过程(七)手动使用Tomcat连接池
DB数据源之SpringBoot+MyBatis踏坑过程(七)手动使用Tomcat连接池 liuyuhang原创,未经允许禁止转载 系列目录连接 DB数据源之SpringBoot+Mybatis踏坑 ...
随机推荐
- css 字体库和动画
@font-face { font-family:'WebSymbols'; src: url('../font/WebSymbols-Regular.otf'); } .icon{ font-fam ...
- 简单了解webservice
webservice是什么 WebService是一种跨编程语言.跨操作系统平台的远程调用技术. 远程调用技术:远程调用是指一台设备上的程序A可以调用另一台设备上的方法B. 跨编程语言:是指服务端.客 ...
- vue 数据驱动原理,响应式 原理?
Object.defineProperty(obj,"name",{ get(){ console.log("被访问了") return obox.innerH ...
- 【转】关于IAP与APP互相跳转的实现
关于IAP与APP互相跳转的实现 首先,在您动手做这个实验之前,先要弄清除咱俩的软硬件有什么不同: 1. 我的CPU是STM32F103ZET6,里面有512K的FLASH,您的CPU如果是其它类型, ...
- deep_learning_初学neural network
神经网络——最易懂最清晰的一篇文章 神经网络是一门重要的机器学习技术.它是目前最为火热的研究方向--深度学习的基础.学习神经网络不仅可以让你掌握一门强大的机器学习方法,同时也可以更好地帮助你理解深度学 ...
- 制作CentOS8安装U盘时遇到的“Minimal BASH-like...”问题
---恢复内容开始--- CentOS8已经推出了,正好最近新到了块服务器硬盘需要安装系统,就拿过来尝一下鲜. 下载好iso文件后,以制作CentOS7安装盘相同的步骤,用UltroISO(软碟通)往 ...
- vue+elementui搭建后台管理界面
1 会话存储 使用html5的 sessionStorage 对象临时保存会话 // 保存会话 sessionStorage.setItem('user', username) // 删除会话 ses ...
- 使用VMware Workstation15安装RHEL7.5以及相关设置(RHEL7及其以上版本均适用)
预备信息 VMware Workstation 是VMware公司发布的一款桌面虚拟计算软件,此软件提供虚拟机功能,使计算机可以同时运行多个操作系统. RHEL7.5(Red Hat Enterpri ...
- python基础:python循环、三元运算、字典、文件操作
目录: python循环 三元运算 字符串 字典 文件操作基础 一.python编程 在面向过程式编程语言的执行流程中包含: 顺序执行 选择执行 循环执行 if是条件判断语句:if的执行流程属于选择执 ...
- PHP中把对象转数组的几个方法
PHP中把对象转数组的几个方法: 1. //PHP stdClass Object转array function object_array($array) { if(is_object($array) ...