Sometimes the aggregate functions provided by Spark are not adequate, so Spark has a provision of accepting custom user defined aggregate functions. Before diving into code lets first understand some of the methods of class UserDefinedAggregateFunction.

1. inputSchema()

In this method you need to define a StructType that represents the input arguments of this aggregate function.

2. bufferSchema()

In this method you need to define a StructType that represents values in the aggregation buffer. This schema is used to hold the aggregate function value at the time of processing.

3. dataType()

The DataType of the returned value of this aggregate function

4. initialize(MutableAggregationBuffer buffer)

Whenever your “key” changes this method is invoked. You can use this method to reinitalise your variable.

5. evaluate(Row buffer)

This method calculates the final value by refering the aggregation buffer.

6. update(MutableAggregationBuffer buffer, Row input)

This method is used to update the aggregation buffer, it is invoked every time a new input comes for similar key

7. merge(MutableAggregationBuffer buffer, Row input)

This method is used to merge output of two different aggregation buffer.

Below is the pictorial representation of how the methods work in spark.Assumption is, there are 2 aggregation buffers for your task

Lets see how we can write a UDAF that accepts multiple values as input and returns multiple values as output.

My input file is a .txt file which contains 3 columns city, female count and male count.We need to compute total population and the dominant population of each city.

CITIES.TXT

Nashik 40 50
Mumbai 50 60
Pune 70 80
Nashik 40 50
Mumbai 50 60
Pune 170 80

Expected output is as below

+--------+--------+--------+
| city   |Dominant| Total  |
+--------+--------+--------+
| Mumbai | Male   | 220    |
| Pune   | Female | 400    |
| Nashik | Male   | 180    |
+--------+--------+--------+

Now lets write a UDAF class that extends UserDefinedAggregateFunction class, I have provided the required comments in the code below.

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.expressions.MutableAggregationBuffer;
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType; public class SparkUDAF extends UserDefinedAggregateFunction
{
private StructType inputSchema;
private StructType bufferSchema;
private DataType returnDataType =
DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType);
MutableAggregationBuffer mutableBuffer; public SparkUDAF()
{
//inputSchema : This UDAF can accept 2 inputs which are of type Integer
List<StructField> inputFields = new ArrayList<StructField>();
StructField inputStructField1 = DataTypes.createStructField(“femaleCount”,DataTypes.IntegerType, true);
inputFields.add(inputStructField1);
StructField inputStructField2 = DataTypes.createStructField(“maleCount”,DataTypes.IntegerType, true);
inputFields.add(inputStructField2);
inputSchema = DataTypes.createStructType(inputFields); //BufferSchema : This UDAF can hold calculated data in below mentioned buffers
List<StructField> bufferFields = new ArrayList<StructField>();
StructField bufferStructField1 = DataTypes.createStructField(“totalCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField1);
StructField bufferStructField2 = DataTypes.createStructField(“femaleCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField2);
StructField bufferStructField3 = DataTypes.createStructField(“maleCount”,DataTypes.IntegerType, true);
bufferFields.add(bufferStructField3);
StructField bufferStructField4 = DataTypes.createStructField(“outputMap”,DataTypes.createMapType(DataTypes.StringType, DataTypes.StringType), true);
bufferFields.add(bufferStructField4);
bufferSchema = DataTypes.createStructType(bufferFields);
} /**
* This method determines which bufferSchema will be used
*/
@Override
public StructType bufferSchema() { return bufferSchema;
} /**
* This method determines the return type of this UDAF
*/
@Override
public DataType dataType() {
return returnDataType;
} /**
* Returns true iff this function is deterministic, i.e. given the same input, always return the same output.
*/
@Override
public boolean deterministic() {
return true;
} /**
* This method will re-initialize the variables to 0 on change of city name
*/
@Override
public void initialize(MutableAggregationBuffer buffer) {
buffer.update(, );
buffer.update(, );
buffer.update(, );
mutableBuffer = buffer;
} /**
* This method is used to increment the count for each city
*/
@Override
public void update(MutableAggregationBuffer buffer, Row input) {
buffer.update(, buffer.getInt() + input.getInt() + input.getInt());
buffer.update(, input.getInt());
buffer.update(, input.getInt());
} /**
* This method will be used to merge data of two buffers
*/
@Override
public void merge(MutableAggregationBuffer buffer, Row input) { buffer.update(, buffer.getInt() + input.getInt());
buffer.update(, buffer.getInt() + input.getInt());
buffer.update(, buffer.getInt() + input.getInt()); } /**
* This method calculates the final value by referring the aggregation buffer
*/
@Override
public Object evaluate(Row buffer) {
//In this method we are preparing a final map that will be returned as output
Map<String,String> op = new HashMap<String,String>();
op.put(“Total”, “” + mutableBuffer.getInt());
op.put(“dominant”, “Male”);
if(buffer.getInt() > mutableBuffer.getInt())
{
op.put(“dominant”, “Female”);
}
mutableBuffer.update(,op); return buffer.getMap();
}
/**
* This method will determine the input schema of this UDAF
*/
@Override
public StructType inputSchema() { return inputSchema;
} } Now lets see how we can access this UDAF using our spark code import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.StringTokenizer; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public class TestDemo {
public static void main (String args[])
{
//Set up sparkContext and SQLContext
SparkConf conf = new SparkConf().setAppName(“udaf”).setMaster(“local”);
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); //create Row RDD
JavaRDD<String> citiesRdd = sc.textFile(“cities.txt”);
JavaRDD<Row> rowRdd = citiesRdd.map(new Function<String, Row>() {
public Row call(String line) throws Exception {
StringTokenizer st = new StringTokenizer(line,” “);
return RowFactory.create(st.nextToken().trim(),Integer.parseInt(st.nextToken().trim()),Integer.parseInt(st.nextToken().trim()));
}
}); //Create Struct Type
List<StructField> inputFields = new ArrayList<StructField>();
StructField inputStructField = DataTypes.createStructField(“city”,DataTypes.StringType, true);
inputFields.add(inputStructField);
StructField inputStructField2 = DataTypes.createStructField(“Female”,DataTypes.IntegerType, true);
inputFields.add(inputStructField2);
StructField inputStructField3 = DataTypes.createStructField(“Male”,DataTypes.IntegerType, true);
inputFields.add(inputStructField3);
StructType inputSchema = DataTypes.createStructType(inputFields); //Create Data Frame
DataFrame df = sqlContext.createDataFrame(rowRdd, inputSchema); //Register our Spark UDAF
SparkUDAF sparkUDAF = new SparkUDAF();
sqlContext.udf().register(“uf”,sparkUDAF); //Register dataframe as table
df.registerTempTable(“cities”); //Run query
sqlContext.sql(“SELECT city , count[‘dominant’] as Dominant, count[‘Total’] as Total from(select city, uf(Female,Male) as count from cities group by (city)) temp”).show(false); }
}

文章来自:https://blog.augmentiq.in/2016/08/05/spark-multiple-inputoutput-user-defined-aggregate-function-udaf-using-java/

转:Spark User Defined Aggregate Function (UDAF) using Java的更多相关文章

  1. Spark笔记之使用UDAF(User Defined Aggregate Function)

    一.UDAF简介 先解释一下什么是UDAF(User Defined Aggregate Function),即用户定义的聚合函数,聚合函数和普通函数的区别是什么呢,普通函数是接受一行输入产生一个输出 ...

  2. Spark SQL中UDF和UDAF

    转载自:https://blog.csdn.net/u012297062/article/details/52227909 UDF: User Defined Function,用户自定义的函数,函数 ...

  3. Spark Sql的UDF和UDAF函数

    Spark Sql提供了丰富的内置函数供猿友们使用,辣为何还要用户自定义函数呢?实际的业务场景可能很复杂,内置函数hold不住,所以spark sql提供了可扩展的内置函数接口:哥们,你的业务太变态了 ...

  4. 【理解】column must appear in the GROUP BY clause or be used in an aggregate function

    column "ms.xxx_time" must appear in the GROUP BY clause or be used in an aggregate functio ...

  5. invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

    Column 'dbo.tbm_vie_View.ViewID' is invalid in the select list because it is not contained in either ...

  6. must appear in the GROUP BY clause or be used in an aggregate function

    今天在分组统计的时候pgsql报错 must appear in the GROUP BY clause or be used in an aggregate function,在mysql里面是可以 ...

  7. 解决spark程序报错:Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

    报错信息: 09-05-2017 09:58:44 CST xxxx_job_1494294485570174 INFO - at org.apache.spark.sql.catalyst.erro ...

  8. spark算子之Aggregate

    Aggregate函数 一.源码定义 /** * Aggregate the elements of each partition, and then the results for all the ...

  9. Spark MLlib 之 aggregate和treeAggregate从原理到应用

    在阅读spark mllib源码的时候,发现一个出镜率很高的函数--aggregate和treeAggregate,比如matrix.columnSimilarities()中.为了好好理解这两个方法 ...

随机推荐

  1. storm安装(2)ZeroMQ、JZMQ、Python、Java环境的安装

    2.ZeroMQ安装 把安装本件zeromq-2.1.7.tar.gz拷贝到home文件路径下, 给文件加入权限 chmod +x /home/zeromq-2.1.7.tar.gz 解压文件 tar ...

  2. .NET里的行为驱动开发

    BDD (Given - When - then) Ruby Cucumber, Java FitNesse , Python RoboFramework, C# specflow nspec .NE ...

  3. 【译】Experienced programmers but new to Objective-C(一)

    注:这是raywenderlich博客上的一个系列文章.是写给从其他语言转到OC语言上的程序员的,一共5节.最近打算学习一下,并且把一些重要的知识点摘抄并且尝试翻译一下,第一次翻译,有些原文如果不知道 ...

  4. vijos1009:扩展欧几里得算法

    1009:数论 扩展欧几里得算法 其实自己对扩展欧几里得算法一直很不熟悉...应该是因为之前不太理解的缘故吧这次再次思考,回看了某位大神的推导以及某位大神的模板应该算是有所领悟了 首先根据题意:L1= ...

  5. MongoDB:利用官方驱动改装为EF代码风格的MongoDB.Repository框架 五 --- 为List<MongoDBRef>增加扩展方法

    本次改动主要内容:为List<MongoDBRef>增加扩展方法 在MongoDB.Repository的使用过程中,发现在一个类中只定义一个List<MongoDBRef>是 ...

  6. mac在线恢复教程

    上个周,我本来想升级一下Xcode,可是我的系统是10.8de,xcode5.0.1 最低支持10.8.4 所以就想升级一下我的mac的系统,可是因为我的appstore 是从别人的电脑上考过来的,要 ...

  7. IceMx.Mvc

    IceMx.Mvc 我的js MVC 框架 开篇 开篇 这篇文章是后补的,前端时间想写一些对于js开发的一些理解,就直接写了,后来发现很唐突,所以今天在这里补一个开篇. 我的js Mvc 框架 基于实 ...

  8. 把事务封装成类似Serializable用法的特性

    把事务封装成类似Serializable用法的特性 最近几天上班没事可做就想着整理常用的类库方法,验证.消息.分页.模版引擎.数据库操作.ini操作.文本操作.xml操作等,最后就是现在这个事务特性. ...

  9. C# 对象池的实现

    C# 对象池的实现 对象池服务可以减少从头创建每个对象的系统开销.在激活对象时,它从池中提取.在停用对象时,它放回池中,等待下一个请求.我们来看下主线程中,如何与对象池打交道: static void ...

  10. Java-调用抽象类中指定参数的构造方法

    abstract class person {  private String name;  private int age;  public person(String name,int age) ...