4.1 Introdution

Consider a company such as Amazon, which has over 200 millions of users and possibly can do hundreds of millions of transactions per day. To show the concept of Left Outer Join, assume we have two types of data: users and transations: where users‘ data has user‘s "location" (saylocation_id) information and transactions has "user" (say user_id) information, but transactions do not have direct information about user‘s locations. Given users and transactions:

users(user_id, location_id)

transactions(transaction_id, product_id, user_id, quantity, amount)

Consider the following values for our users and transactions (note that these values are just examples to demonstrate the concept of Left Outer Join in MapReduce environment):

Here are some SQL queries to answer our questions:

  • Q1:find all products (and associated location) sold
mysql> SELECT product_id, location_id
-> FROM transactions LEFT OUTER JOIN users
-> ON transaction.user_id = user_id;

  • Q2:find all products (and associated location counts) sold
mysql> SELECT product_id, count(location_id)
-> FROM transactions LEFT OUTER JOIN users
-> ON transaction.user_id = user_id
-> group by product_id;

  • Q3:find all products (and unique location counts) sold
mysql> SELECT product_id, count(distinct location_id)
-> FROM transactions LEFT OUTER JOIN users
-> ON transaction.user_id = user_id
-> group by product_id;

左外连接的概念在学数据库的时候有涉及到,大概就是有两张表,分别称为左表和右表,按照某些属性值相等连接起来,左外连接就是左表的全部记录都会在,要是右表中没有对应的就填NULL。引用了书上假设的实际应用,给出了SQL的相应的解决语句,这章主要介绍如何在MApReduce环境中实现左外连接的问题。

4.2 Implementation of Left Outer join in MapReduce

The desired output we are looking is provided by SQL Query-3, which finds all number of distinct (unique) locations in which each product has been sold for given all transactions. We present solution for Left Outer Join in two steps:

  • MapReduce Phase-1: find all products (and associated locations) sold. The answer to Phase-1 is using SQL Query-1.
  • MapReduce Phase-2: find all products (and associated unique location counts) sold. The answer to Phase-2 is using SQL Query-3.

我们想要的输出就像上面引用的Q3的那样,即找出产品被卖出的地点的总数。为此,我们分成两个阶段,先找出产品被卖出的地点,然后计算各总数。

4.2.1 MapReduce Phase-1

This phase will perform "left outer join" operation with a MapReduce job, which will utilize two mappers (one for users and the other one for transactions) and reducer will emit (Key,Value) with Key = product_id, and Value = location_id. Using multiple mappers is enabled by the MultipleInputs class (note that if we had a single mapper, then we would have used *Job.setMapper *Class()):



MapReduce解决这个问题的阶段一需要用到两个mapper,一个处理users数据,一个处理transactions数据。user mapper读入数据,输出键值对(user_id, location_id)。transcation mapper读入数据,输出键值对(user_id, product_id)。最后还有个reducer,接收前面二者的输出,然后自己输出键值对(product_id, location_id)。看起来好像很简单,但是实际上,这个是办不到的。因为reducer接收的键值对除了按键排序之外,其它并没有任何内在联系,你没办法从这样的数据中找到产品的销售地。所以书上把这称为[VERSION-1],也就有了下面改进的[VERSION-2](那何必讲[VERSION-1]呢)。

4.2.1.1 Transaction Mapper [VERSION-2]

the transaction map() reads (transaction_id, product_id, user_id, quantity, amount) and emits (key=Pair(user_id, 2), value=Pair("P", product_id).

4.2.1.2 User Mapper [VERSION-2]

the user map() reads (user_id, location_id) and emits (key=Pair(user_id, 1), value=Pair("L", location_id)).

4.2.1.3 The reduce of Phase-1 [VERSION-2]

gets both Pair(“L”, location_id) and Pair("P",product_id) and emits (key=product_id, value=location_id).

[VERSION-2]两个mapper输出的键值对比较复杂,也比较奇怪,user mapper输出(key=Pair(user_id, 1), value=Pair("L", location_id)),transaction mapper输出(key=Pair(user_id, 2), value=Pair("P", product_id)。为了理解算法思路,先要清楚两点:一是reducer收到的键值对是按键排序的,即user_id;二是一个user_id对应的location_id只有一个。其中的“L”和“P”用于鉴别到底是什么的id。上面输出键值对中的数字1和2,是用来二次排序用的,希望在按键排序的基础上,每个user_id后面的value的第一个会是location_id。这样,在碰到下一个user_id的location_id之间的value则都是在该地卖出去的货物id。如此一来,上面引用的reducer的代码也就可以理解了(难得看懂一次)。不过,这也是特殊,并不通用,万一location不唯一呢,还是我理解错了。上面书上手绘的那张数据流图,每个user_id后面的location_id也并没有在第一条,不过还是觉得我的理解挺合理的。

4.2.2 MapReduce Phase-2:Counting Unique Locations

This phase will use output of Phase-1 (which is a sequence of pairs of (product_id, location_id) and generates pairs of (product_id, number_of_unique_locations). The mapper for this phase is an identity mapper and the reducer will count the number of unique locations (by using a Set data structure) per product.

4.2.2.1 Mapper Phase-2:Counting Unique Locations

4.2.2.2 Reduce Phase-2:Counting Unique Locations

4.2.3 Implementation Classes in Hadoop

跳过“4.3 Sample Run”。

4.4 Spark Implementation

Since Spark provids a higher-level Java API than MapReduce/Hadoop API, we will present the whole solution in a single Java class (called LeftOuterJoin), which will include a series of map(),groupBy(), and reduce() functions.



This is how the algorithm works: for users and transactions data we generate (here T2 refers to Tuple2):

users => (userID, T2("L", location))

transactions => (userID, T2("P", product))

Next, we create a union of these data:

all= transactions.union(users);

= { (userID1, T2("L", location)),

(userID1, T2("P", P11),

(userID1, T2("P", P12),

...

(userID1, T2("P", P1n),

...

}

where Pi is a productID

The next step is to group data by userID. This will generate:

{

(userID1, List<T2("L", L1), T2("P", P11), T2("P", P12), ...>),

(userID2, List<T2("L", L2), T2("P", P21), T2("P", P22), ...>),

...

}

where

Li is a locationID,

Pij is a productID.

Spark有更高级的Java API,可以只用一个类就实现先前的算法。Spark为mappers和reducers提供了更丰富的API,不用特别的插入类,你就可以有多种不同的类型的mapper,而且我们将会使用JavaRDD.union()函数来返会两个JavaRDDs(user RDD and transaction RDD)的合并(合并的二者得是同一类型)。

4.4.1 Spark Program

下面照例蛮把代码贴上来。

4.4.2 SEPT-0:Import Required Classes

4.4.3 SEPT-1:Read Input Parameters

4.4.4 SEPT-2:Create JavaSparkContext Object

A JavaSparkContext object is created by using spark master URL. This object is used to create first RDD.

4.4.5 SEPT-3:Create a JavaPairRDD for Users

First, we create a users JavaRDD<String>, where RDD element is a single record of text file (representing userID and locationID). Next, we use JavaRDD<String> .mapToPair() function to create a new JavaPairRDD<String,Tuple2<String,String>> where key is a userID and value is a Tuple2("L", location).

4.4.6 SEPT-4:Create a JavaPairRDD for Transactions

First, we create a transactions JavaRDD<String> , where RDD element is a single record of text file (representing transaction record). Next, we use JavaRDD<String> .mapToPair() function to create a new JavaPairRDD<String,Tuple2<String,String>> where key is a userID and value is a Tuple2("P", product).

4.4.7 SEPT-5:Create a union of RDD's created by SEPT-3 and SEPT-4

This step creates union of two JavaPairRDD<String,Tuple2<String,String>>s. the JavaPairRDD.union() requires that both RDD’s to have the same exact types.



4.4.8 SEPT-6:Create a JavaPairRDD(userID, List(T2)) by calling groupBy()

Next, we group data (created in STEP-5) by userID. This step is accomplished by JavaPairRDD.groupByKey().

这里键值对的顺序就符合我的理解,同一用户中,地点放在第一位。

4.4.9 SEPT-7:Create a productLocationsRDD as JavaPairRDD(Sting, String)

This step is accomplished by JavaPairRDD.flatMapToPair() function, which we implement a PairFlatMapFunction.call() method. The PairFlatMapFunction workd as:

PairFlatMapFunction<T, K, V>

T => Iterable<Tuple2<K, V>>

where in our example: T is an input and we create (K, V) pairs as output:

t = Tuple2<String, Iterable<Tuple2<String, String>>>

K = String

V = String

4.4.10 SEPT-8:Find all locations for a product

This step groups RDD pairs of (product, location) by grouping of products. We use JavaPairRDD.groupByKey() to accomplish this step. This step does some basic debugging too by calling JavaPairRDD.collect() function.

4.4.11 SEPT-9:Finalize output by changing "value"

STEP-8 produced a JavaPairRDD<String, List<String>> object, where key is product (as a String) and value is a List<String>, which a list of locations (but might have duplicates). To removeduplicate elements from a vlue, we use a JavaPairRDD.mapValues() function. We implement this function by converting a List<String> to a Set<String>. Note that the keys are not altered. Mapping values are implemented by a Function(T, R).call(), where T is an input (as List<String>) and R is an output (as Tuple2<Set<String>, Integer>).

4.4.12 SEPT-10:Print the final result RDD

The final step emits the results by using JavaPairRDD.collect() method.

跳过“4.4.13 Running Spark Solution”以及“4.5 Running Spark on YARN”。

4.6 Left Outer join by Spark's leftOuterjoin()

This section solves the left outer join by using Spark’s built

in JavaPairRDD.leftOuterJoin() method.Using Sparks’s JavaPairRDD.leftOuterJoin() method enable us:

  1. To avoid the costly JavaPairRDD.union() operation between users and transactions.
  2. To avoid introducing custom flags such as ‖L‖ for location and "P" for products.
  3. To avoid extra RDD transformations to separate custom flags from each other.

    Using JavaPairRDD.leftOuterJoin() method enable us to produce the result efficiently. The transactionsRDD is the left table and usersRDD is the right table.

使用Spark内置的JavaPairRDD.leftOuterJoin()方法来解决这个问题,更加方便高效。不过左外连接之后还包含users_id,要记得去掉。然后依旧蛮把代码贴上来。

4.6.1 High-Level Steps

4.6.2 SEPT-0:import required classes and interfaces

4.6.3 SEPT-1:read input parameters

4.6.4 SEPT-2:create Spark's context object

4.6.5 SEPT-3:create RDD for user's data

This step creates usersRDD, which is a set of (userID, location) pairs. The usersRDD represents the "right" table for the left outer join operation.

4.6.6 SEPT-4:Create usersRDD:The "right" Table

This step create the right table represented as usersRDD, which contain (K=userID,V=location) pairs from users input data.

4.6.7 SEPT-5:create transactionRDD for transaction's data

This step creates transactionRDD, which is a set of (userID, product) pairs. The transactionRDD represents the "left" table for the left outer join operation.

4.6.8 SEPT-6:Create transactionsRDD:The Left Table

This step create the left table represented as transactionsRDD, which contain (K=userID,V=product) pairs from transactions input data.

4.6.9 SEPT-7:use Spark's bulit-in JavaPairRDD.leftOutputJoin() method

This is core step for performing the left outer join operation by using Spark’s JavaPairRDD.leftOuterJoin() method.

4.6.10 SEPT-8:create (product, location)pairs

This step builds another JavaPairRDD, which contains (K=product, V=location) pairs. Note that we completely ignored the userIDs, since we are only interested in products and their unique user’s locations.

4.6.11 SEPT-9:group (k=product, V=location) pairs by K

This step groups (K=product, V=location) pairs by K. The result will be (K, V2) where V2 is a list of locations (will have duplicate locations).

4.6.12 SEPT-10:create final output (K=product, V=Set(location))

This final step removes duplicate locations and creates (K,V2), where V2 is a Tuple2<Set<location>,size>.

跳过“4.6.13 Sample Run by YARN”。

最后再来稍微总结一下,同样因为蛮贴了实现代码,博文看起来很长,实际上没什么内容。这章讨论的左外连接问题,在数据库课上多少有涉及到一点,理解起来并没有什么问题。不同的是,这次的数据是大数据,SQL并不适合,讨论的是如何用Hadoop和Spark来解决左外连接问题。书上一共介绍了三种方法来解决示例,但是感觉前面两种并不是通用的方法。示例是说有两张表,一张记录着用户ID和地点ID,另一张记录着用户ID,产品ID以及其他一些业务信息,问题是找出产品被卖到哪些地点,即输出键值对(product_id, location_id)。解决的思路,肯定大数据首先要分块,交给很多的mapper来执行,从不同的数据中提取出我们需要的信息,即从用户表输出(user_id, location_id),从业务表输出(user_id, product_id)。这两个输出都传到reducer,按照user_id左外连接起来,最后输出结果。不过,没有经过处理的两个输出,最后reducer收到的键值对之间没有任何内在联系,我们没办法区分user_id后面是地点信息还是产品信息。所以前面两种方法的mapper的输出实际上没有这么简单,添加了一些东西,用来二次排序以及区分地点和产品。使得最后到达redecer的键值对不仅按user_id排序,并且每个user_id后面的第一个信息是地点,这样在遇到下一个user_id之前的信息都是销售到该地的产品,输出结果就变得很容易了。之所以说不是通用方法,因为这里user_id只对应着一个地点,要是不止一个地点呢。最后的方法用到了Spark里的leftOuterjoin()方法,我们只要去掉连接后的user_id,再稍微按product_id整理一下就可以输出了。因为左外连接直接用方法实现,而不像前面两种有针对性地特别写,感觉是通用的方法。至于具体的实现代码,依然蛮贴蛮看,虽然打不出来。运行示例运行出来肯定都是对的,感觉没什么必要贴上来。以上。

Chapter 4 Left Outer Join in MapReduce的更多相关文章

  1. hive left outer join的问题

    最近BA用户反馈有两句看似很像的语句返回的结果数不一样,比较奇怪,怀疑是不是Hive的Bug Query 1 返回结果数6071 select count(distinct reviewid) as ...

  2. left outer join的on不起作用

    left outer join的on不起作用 Why and when a LEFT JOIN with condition in WHERE clause is not equivalent to ...

  3. HIVE中join、semi join、outer join

    补充说明 left outer join where is not null与left semi join的联系与区别:两者均可实现exists in操作,不同的是,前者允许右表的字段在select或 ...

  4. Oracle Partition Outer Join 稠化报表

    partition outer join实现将稀疏数据转为稠密数据,举例: with t as (select deptno, job, sum(sal) sum_sal from emp group ...

  5. SQL Server 2008 R2——使用FULL OUTER JOIN实现多表信息汇总

    =================================版权声明================================= 版权声明:原创文章 谢绝转载  请通过右侧公告中的“联系邮 ...

  6. SQL中inner join、outer join和cross join的区别

    对于SQL中inner join.outer join和cross join的区别简介:现有两张表,Table A 是左边的表.Table B 是右边的表.其各有四条记录,其中有两条记录name是相同 ...

  7. SQL的inner join、left join、right join、full outer join、union、union all

    主题: SQL的inner join.left join.right join.full outer join.union.union all的学习. Table A和Table B表如下所示: 表A ...

  8. Outer Join Query Over Dblink Can Fail With ORA-904 (Doc ID 730256.1)

    Outer Join Query Over Dblink Can Fail With ORA-904 (Doc ID 730256.1) To Bottom Modified:03-May-2013T ...

  9. 图解SQL的inner join、left join、right join、full outer join、union、union all的区别

    转自:http://blog.csdn.net/jz20110918/article/details/41806611 假设我们有两张表.Table A 是左边的表.Table B 是右边的表.其各有 ...

随机推荐

  1. ibatis学习笔记(完整)

    1.       Ibatis是开源软件组织Apache推出的一种轻量级的对象关系映射(ORM)框架,和Hibernate.Toplink等在java编程的对象持久化方面深受开发人员欢迎. 对象关系映 ...

  2. 使用Microsoft Azure云平台中的Service Bus 中继 Intanet环境下的WCF服务。

    之前写的一篇文章:) 看起来好亲切. http://www.cnblogs.com/developersupport/archive/2013/05/23/WCF-ON-IIS-Azure-Servi ...

  3. maven+tomcat热部署

    1.首先修改tomcat安装目录下的conf文件夹中的tomcat-user.xml文件 <role rolename="manager-gui"/> <role ...

  4. SQL Serever学习12——数据库的备份和还原

    公司的服务器奔溃了,事先没相应的保护措施,使得公司遭到了较大损失,为了以后不再出现类似事件,在系统中引入备份机制,使得数据库被破坏后损失降到最低. 数据的导出和导入 数据转换服务 数据转换服务DTS( ...

  5. unity GUI Layout 组件(全)

    [expand 扩张][fitter 装配工] [envelope 信封,包装] Layout 布局 三种.   Horizontal Layout Group 水平布局 Padding:内边距,单位 ...

  6. 使用admin lte 碰到访问Google字体的问题

    下载了admin lte 的模板,运行的时候,发现很慢,看了一下console,发现adminlte.css里有import google的字体文件,众所周知的原因,无法访问,所以网页很慢,没办法,只 ...

  7. Code Signal_练习题_arrayChange

    You are given an array of integers. On each move you are allowed to increase exactly one of its elem ...

  8. python中单下划线和双下滑线

    使用单下划线(_one_underline)开头表示方法不是API的一部分,不要直接访问(虽然语法上访问也没有什么问题). 使用双下划线开头(__two_underlines)开头表示子类不能覆写该方 ...

  9. css文本内容显示省略号

    文字显示省略号width: 4.5rem;overflow: hidden;white-space: nowrap;text-overflow: ellipsis; 但是这个属性只支持单行文本的溢出显 ...

  10. Django请求生命周期之响应内容

    Django请求生命周期: 1.发送http请求2.服务器接受,根据请求头中的url在路由关系表中进行匹配(从上到下)3.匹配成功后,执行指定的views函数 URL -> 函数 ==>F ...