SparkSQL UDF使用方法与原理详解

UDF是SQL中很常见的功能，但在Spark-1.6及之前的版本，只能创建临时UDF，不支持创建持久化的UDF，除非修改Spark源码。从Spark-2.0开始，SparkSQL终于支持持久化的UDF。讲解SparkSQL中使用UDF和底层实现的原理。

1. 临时UDF

创建和使用方法:

create temporary function tmp_trans_array as ''com.test.spark.udf.TransArray' using jar 'spark-test-udf-1.0..jar';

select tmp_trans_array (, '\\|' , id, position) as (id0, position0) from test_udf limit ;

实现原理，在org.apache.spark.sql.execution.command.CreateFunctionCommand类的run方法中，会判断创建的Function是否是临时方法，若是，则会创建一个临时Function。从下面的代码我可以看到，临时函数直接注册到functionRegistry(实现类是SimpleFunctionRegistry)，即内存中。

def createTempFunction(

    name: String,

    info: ExpressionInfo,

    funcDefinition: FunctionBuilder,

    ignoreIfExists: Boolean): Unit = {

  if (functionRegistry.lookupFunctionBuilder(name).isDefined && !ignoreIfExists) {

    throw new TempFunctionAlreadyExistsException(name)

  }

  functionRegistry.registerFunction(name, info, funcDefinition)

}

下面是实际的注册代码，所有需要的UDF都会加载到StringKeyHashMap。

protected val functionBuilders =

  StringKeyHashMap[(ExpressionInfo, FunctionBuilder)](caseSensitive = false)

override def registerFunction(

    name: String,

    info: ExpressionInfo,

    builder: FunctionBuilder): Unit = synchronized {

  functionBuilders.put(name, (info, builder))

}

2. 持久化UDF

使用方法如下，注意jar包最好放在HDFS上，在其他机器上也能使用。

create function trans_array as 'com.test.spark.udf.TransArray'  using jar 'hdfs://namenodeIP:9000/libs/spark-test-udf-1.0.0.jar';

select trans_array (, ' \\|' , id, position) as (id0, position0) from test_spark limit ；

实现原理

（1）创建永久函数时，在org.apache.spark.sql.execution.command.CreateFunctionCommand中，会调用SessionCatalog的createFunction，最终执行了HiveExternalCatalog的createFunction，这里可以看出，创建永久函数会在Hive元数据库中创建相应的函数。通过查询元数据库我们可以看到如下记录，说明函数已经创建到元数据库中。

mysql> select *  from FUNCS;

| FUNC_ID    | CLASS_NAME                    | CREATE_TIME | DB_ID | FUNC_NAME     | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |

|          | com.test.spark.udf.TransArray |   |      | trans_array   |          | NULL       | USER       |

mysql> select *  from FUNC_RU;

| FUNC_ID | RESOURCE_TYPE | RESOURCE_URI                                         | INTEGER_IDX |

|       |              | hdfs://namenodeIP:9000/libs/spark-test-udf-1.0.0.jar |  0          |

（2）使用永久函数，在解析SQL中的UDF时，会调用SessionCatalog的lookupFunction0方法，在此方法中，首先会检查内存中是否存在，如果不存在则会加载此UDF，加载时会把RESOURCE_URI发到ClassLoader的路径中，如果把UDF注册到内存的functionRegistry中。主要代码在SessionCatalog，如下：

 def lookupFunction(

      name: FunctionIdentifier,

      children: Seq[Expression]): Expression = synchronized {

    // Note: the implementation of this function is a little bit convoluted.

    // We probably shouldn't use a single FunctionRegistry to register all three kinds of functions

    // (built-in, temp, and external).

    if (name.database.isEmpty && functionRegistry.functionExists(name)) {

      // This function has been already loaded into the function registry.

      return functionRegistry.lookupFunction(name, children)

    }

    // If the name itself is not qualified, add the current database to it.

    val database = formatDatabaseName(name.database.getOrElse(getCurrentDatabase))

    val qualifiedName = name.copy(database = Some(database))

    if (functionRegistry.functionExists(qualifiedName)) {

      // This function has been already loaded into the function registry.

      // Unlike the above block, we find this function by using the qualified name.

      return functionRegistry.lookupFunction(qualifiedName, children)

    }

    // The function has not been loaded to the function registry, which means

    // that the function is a permanent function (if it actually has been registered

    // in the metastore). We need to first put the function in the FunctionRegistry.

    // TODO: why not just check whether the function exists first?

    val catalogFunction = try {

      externalCatalog.getFunction(database, name.funcName)

    } catch {

      case _: AnalysisException => failFunctionLookup(name)

      case _: NoSuchPermanentFunctionException => failFunctionLookup(name)

    }

    loadFunctionResources(catalogFunction.resources)

    // Please note that qualifiedName is provided by the user. However,

    // catalogFunction.identifier.unquotedString is returned by the underlying

    // catalog. So, it is possible that qualifiedName is not exactly the same as

    // catalogFunction.identifier.unquotedString (difference is on case-sensitivity).

    // At here, we preserve the input from the user.

    registerFunction(catalogFunction.copy(identifier = qualifiedName), overrideIfExists = false)

    // Now, we need to create the Expression.

    functionRegistry.lookupFunction(qualifiedName, children)

  }

  /**

   * List all functions in the specified database, including temporary functions. This

   * returns the function identifier and the scope in which it was defined (system or user

   * defined).

   */

  def listFunctions(db: String): Seq[(FunctionIdentifier, String)] = listFunctions(db, "*")

  /**

   * List all matching functions in the specified database, including temporary functions. This

   * returns the function identifier and the scope in which it was defined (system or user

   * defined).

   */

  def listFunctions(db: String, pattern: String): Seq[(FunctionIdentifier, String)] = {

    val dbName = formatDatabaseName(db)

    requireDbExists(dbName)

    val dbFunctions = externalCatalog.listFunctions(dbName, pattern).map { f =>

      FunctionIdentifier(f, Some(dbName)) }

    val loadedFunctions = StringUtils

      .filterPattern(functionRegistry.listFunction().map(_.unquotedString), pattern).map { f =>

        // In functionRegistry, function names are stored as an unquoted format.

        Try(parser.parseFunctionIdentifier(f)) match {

          case Success(e) => e

          case Failure(_) =>

            // The names of some built-in functions are not parsable by our parser, e.g., %

            FunctionIdentifier(f)

        }

      }

    val functions = dbFunctions ++ loadedFunctions

    // The session catalog caches some persistent functions in the FunctionRegistry

    // so there can be duplicates.

    functions.map {

      case f if FunctionRegistry.functionSet.contains(f) => (f, "SYSTEM")

      case f => (f, "USER")

    }.distinct

  }

SparkSQL UDF使用方法与原理详解的更多相关文章

JS跨域（ajax跨域、iframe跨域）解决方法及原理详解（jsonp）
这里说的js跨域是指通过js在不同的域之间进行数据传输或通信,比如用ajax向一个不同的域请求数据,或者通过js获取页面中不同域的框架中(iframe)的数据.只要协议.域名.端口有任何一个不同,都被 ...
【转】JS跨域（ajax跨域、iframe跨域）解决方法及原理详解（jsonp）
这里说的js跨域是指通过js在不同的域之间进行数据传输或通信,比如用ajax向一个不同的域请求数据,或者通过js获取页面中不同域的框架中(iframe)的数据.只要协议.域名.端口有任何一个不同,都被 ...
[转]js中几种实用的跨域方法原理详解
转自:js中几种实用的跨域方法原理详解 - 无双 - 博客园 // // 这里说的js跨域是指通过js在不同的域之间进行数据传输或通信,比如用ajax向一个不同的域请求数据,或者通过js获取页面中不同 ...
I2C 基础原理详解
今天来学习下I2C通信~ I2C(Inter-Intergrated Circuit)指的是 IC(Intergrated Circuit)之间的(Inter) 通信方式.如上图所以有很多的周边设备都 ...
块级格式化上下文(block formatting context)、浮动和绝对定位的工作原理详解
CSS的可视化格式模型中具有一个非常重要地位的概念——定位方案.定位方案用以控制元素的布局,在CSS2.1中,有三种定位方案——普通流.浮动和绝对定位: 普通流:元素按照先后位置自上而下布局,inli ...
SSL/TLS 原理详解
本文大部分整理自网络,相关文章请见文后参考. SSL/TLS作为一种互联网安全加密技术,原理较为复杂,枯燥而无味,我也是试图理解之后重新整理,尽量做到层次清晰.正文开始. 1. SSL/TLS概览 1 ...
WebActivator的实现原理详解
WebActivator的实现原理详解文章内容上篇文章,我们分析如何动态注册HttpModule的实现,本篇我们来分析一下通过上篇代码原理实现的WebActivator类库,WebActivato ...
Influxdb原理详解
本文属于<InfluxDB系列教程>文章系列,该系列共包括以下 15 部分: InfluxDB学习之InfluxDB的安装和简介 InfluxDB学习之InfluxDB的基本概念 Infl ...
【转】VLAN原理详解
1.为什么需要VLAN 1.1 什么是VLAN? VLAN(Virtual LAN),翻译成中文是“虚拟局域网”.LAN可以是由少数几台家用计算机构成的网络,也可以是数以百计的计算机构成的企业网络.V ...

随机推荐

EXCEL数据匹配：The 'Microsoft.Jet.Oledb.4.0' provider is not registered on the local machin
百度的处理结果: 作者:LisenYang http://blog.csdn.net/lisenyang/article/details/52106492 这篇博文里面说的,默认设置修改[启动32应用 ...
OpenCV——轮廓特征描述
检测出特定轮廓,可进一步对其特征进行描述,从而识别物体. 1. 如下函数,可以将轮廓以多种形式包围起来. // 轮廓表示为一个矩形 Rect r = boundingRect(Mat(contours ...
获取访客IP、地区位置信息、浏览器、来源页面
<?php //这个类似用来获取访客信息的 //方便统计 class visitorInfo { //获取访客ip public function getIp() { $ip=false; if ...
create-react-app中img图片不现实
场景:正常的情况下是这么引用图片,我的图片路径是 src/images/login-from-icon1.png <img src="../images/login-from-icon ...
MySQL里面的子查询
一.子查询定义定义: 子查询允许把一个查询嵌套在另一个查询当中. 子查询,又叫内部查询,相对于内部查询,包含内部查询的就称为外部查询. 子查询可以包含普通select可以包括的任何子句,比如:dis ...
vue生成路由实例, 使用单个vue文件模板生成路由
一.vue-loader与vue-router配合 $ cnpm install vue-router --save 二.生成vue-webpack模板 $ vue init webpack-simp ...
centos7和Ubuntu上的关机需要手动关闭电源的问题
author:heandsen chen date: 2018-11-11 20:36:38. # halt 执行后会出现这个问题解决办法: # init 0 # shutdown -h now ...
数据库outer连接
left (此处省略outer) join, 左边连接右边,左边最大,匹配所有的行,不管右边 right join,右边连接左边,右边最大,匹配所有的行,不管左边条件直接放ON后面,是先筛选右边的表 ...
牛客网多校赛第9场 E-Music Game【概率期望】【逆元】
链接:https://www.nowcoder.com/acm/contest/147/E 来源:牛客网时间限制:C/C++ 1秒,其他语言2秒空间限制:C/C++ 262144K,其他语言524 ...
HDU 6441 - Find Integer - [费马大定理][2018CCPC网络选拔赛第4题]
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=6441 Time Limit: 2000/1000 MS (Java/Others) Memory Li ...

SparkSQL UDF使用方法与原理详解

SparkSQL UDF使用方法与原理详解的更多相关文章

随机推荐

热门专题