4.1 Introdution

Consider a company such as Amazon, which has over 200 millions of users and possibly can do hundreds of millions of transactions per day. To show the concept of Left Outer Join, assume we have two types of data: users and transations: where users‘ data has user‘s "location" (saylocation_id) information and transactions has "user" (say user_id) information, but transactions do not have direct information about user‘s locations. Given users and transactions:

users(user_id, location_id)

transactions(transaction_id, product_id, user_id, quantity, amount)

Consider the following values for our users and transactions (note that these values are just examples to demonstrate the concept of Left Outer Join in MapReduce environment):

Here are some SQL queries to answer our questions:

  • Q1:find all products (and associated location) sold
mysql> SELECT product_id, location_id
-> FROM transactions LEFT OUTER JOIN users
-> ON transaction.user_id = user_id;

  • Q2:find all products (and associated location counts) sold
mysql> SELECT product_id, count(location_id)
-> FROM transactions LEFT OUTER JOIN users
-> ON transaction.user_id = user_id
-> group by product_id;

  • Q3:find all products (and unique location counts) sold
mysql> SELECT product_id, count(distinct location_id)
-> FROM transactions LEFT OUTER JOIN users
-> ON transaction.user_id = user_id
-> group by product_id;

左外连接的概念在学数据库的时候有涉及到,大概就是有两张表,分别称为左表和右表,按照某些属性值相等连接起来,左外连接就是左表的全部记录都会在,要是右表中没有对应的就填NULL。引用了书上假设的实际应用,给出了SQL的相应的解决语句,这章主要介绍如何在MApReduce环境中实现左外连接的问题。

4.2 Implementation of Left Outer join in MapReduce

The desired output we are looking is provided by SQL Query-3, which finds all number of distinct (unique) locations in which each product has been sold for given all transactions. We present solution for Left Outer Join in two steps:

  • MapReduce Phase-1: find all products (and associated locations) sold. The answer to Phase-1 is using SQL Query-1.
  • MapReduce Phase-2: find all products (and associated unique location counts) sold. The answer to Phase-2 is using SQL Query-3.

我们想要的输出就像上面引用的Q3的那样,即找出产品被卖出的地点的总数。为此,我们分成两个阶段,先找出产品被卖出的地点,然后计算各总数。

4.2.1 MapReduce Phase-1

This phase will perform "left outer join" operation with a MapReduce job, which will utilize two mappers (one for users and the other one for transactions) and reducer will emit (Key,Value) with Key = product_id, and Value = location_id. Using multiple mappers is enabled by the MultipleInputs class (note that if we had a single mapper, then we would have used *Job.setMapper *Class()):



MapReduce解决这个问题的阶段一需要用到两个mapper,一个处理users数据,一个处理transactions数据。user mapper读入数据,输出键值对(user_id, location_id)。transcation mapper读入数据,输出键值对(user_id, product_id)。最后还有个reducer,接收前面二者的输出,然后自己输出键值对(product_id, location_id)。看起来好像很简单,但是实际上,这个是办不到的。因为reducer接收的键值对除了按键排序之外,其它并没有任何内在联系,你没办法从这样的数据中找到产品的销售地。所以书上把这称为[VERSION-1],也就有了下面改进的[VERSION-2](那何必讲[VERSION-1]呢)。

4.2.1.1 Transaction Mapper [VERSION-2]

the transaction map() reads (transaction_id, product_id, user_id, quantity, amount) and emits (key=Pair(user_id, 2), value=Pair("P", product_id).

4.2.1.2 User Mapper [VERSION-2]

the user map() reads (user_id, location_id) and emits (key=Pair(user_id, 1), value=Pair("L", location_id)).

4.2.1.3 The reduce of Phase-1 [VERSION-2]

gets both Pair(“L”, location_id) and Pair("P",product_id) and emits (key=product_id, value=location_id).

[VERSION-2]两个mapper输出的键值对比较复杂,也比较奇怪,user mapper输出(key=Pair(user_id, 1), value=Pair("L", location_id)),transaction mapper输出(key=Pair(user_id, 2), value=Pair("P", product_id)。为了理解算法思路,先要清楚两点:一是reducer收到的键值对是按键排序的,即user_id;二是一个user_id对应的location_id只有一个。其中的“L”和“P”用于鉴别到底是什么的id。上面输出键值对中的数字1和2,是用来二次排序用的,希望在按键排序的基础上,每个user_id后面的value的第一个会是location_id。这样,在碰到下一个user_id的location_id之间的value则都是在该地卖出去的货物id。如此一来,上面引用的reducer的代码也就可以理解了(难得看懂一次)。不过,这也是特殊,并不通用,万一location不唯一呢,还是我理解错了。上面书上手绘的那张数据流图,每个user_id后面的location_id也并没有在第一条,不过还是觉得我的理解挺合理的。

4.2.2 MapReduce Phase-2:Counting Unique Locations

This phase will use output of Phase-1 (which is a sequence of pairs of (product_id, location_id) and generates pairs of (product_id, number_of_unique_locations). The mapper for this phase is an identity mapper and the reducer will count the number of unique locations (by using a Set data structure) per product.

4.2.2.1 Mapper Phase-2:Counting Unique Locations

4.2.2.2 Reduce Phase-2:Counting Unique Locations

4.2.3 Implementation Classes in Hadoop

跳过“4.3 Sample Run”。

4.4 Spark Implementation

Since Spark provids a higher-level Java API than MapReduce/Hadoop API, we will present the whole solution in a single Java class (called LeftOuterJoin), which will include a series of map(),groupBy(), and reduce() functions.



This is how the algorithm works: for users and transactions data we generate (here T2 refers to Tuple2):

users => (userID, T2("L", location))

transactions => (userID, T2("P", product))

Next, we create a union of these data:

all= transactions.union(users);

= { (userID1, T2("L", location)),

(userID1, T2("P", P11),

(userID1, T2("P", P12),

...

(userID1, T2("P", P1n),

...

}

where Pi is a productID

The next step is to group data by userID. This will generate:

{

(userID1, List<T2("L", L1), T2("P", P11), T2("P", P12), ...>),

(userID2, List<T2("L", L2), T2("P", P21), T2("P", P22), ...>),

...

}

where

Li is a locationID,

Pij is a productID.

Spark有更高级的Java API,可以只用一个类就实现先前的算法。Spark为mappers和reducers提供了更丰富的API,不用特别的插入类,你就可以有多种不同的类型的mapper,而且我们将会使用JavaRDD.union()函数来返会两个JavaRDDs(user RDD and transaction RDD)的合并(合并的二者得是同一类型)。

4.4.1 Spark Program

下面照例蛮把代码贴上来。

4.4.2 SEPT-0:Import Required Classes

4.4.3 SEPT-1:Read Input Parameters

4.4.4 SEPT-2:Create JavaSparkContext Object

A JavaSparkContext object is created by using spark master URL. This object is used to create first RDD.

4.4.5 SEPT-3:Create a JavaPairRDD for Users

First, we create a users JavaRDD<String>, where RDD element is a single record of text file (representing userID and locationID). Next, we use JavaRDD<String> .mapToPair() function to create a new JavaPairRDD<String,Tuple2<String,String>> where key is a userID and value is a Tuple2("L", location).

4.4.6 SEPT-4:Create a JavaPairRDD for Transactions

First, we create a transactions JavaRDD<String> , where RDD element is a single record of text file (representing transaction record). Next, we use JavaRDD<String> .mapToPair() function to create a new JavaPairRDD<String,Tuple2<String,String>> where key is a userID and value is a Tuple2("P", product).

4.4.7 SEPT-5:Create a union of RDD's created by SEPT-3 and SEPT-4

This step creates union of two JavaPairRDD<String,Tuple2<String,String>>s. the JavaPairRDD.union() requires that both RDD’s to have the same exact types.



4.4.8 SEPT-6:Create a JavaPairRDD(userID, List(T2)) by calling groupBy()

Next, we group data (created in STEP-5) by userID. This step is accomplished by JavaPairRDD.groupByKey().

这里键值对的顺序就符合我的理解,同一用户中,地点放在第一位。

4.4.9 SEPT-7:Create a productLocationsRDD as JavaPairRDD(Sting, String)

This step is accomplished by JavaPairRDD.flatMapToPair() function, which we implement a PairFlatMapFunction.call() method. The PairFlatMapFunction workd as:

PairFlatMapFunction<T, K, V>

T => Iterable<Tuple2<K, V>>

where in our example: T is an input and we create (K, V) pairs as output:

t = Tuple2<String, Iterable<Tuple2<String, String>>>

K = String

V = String

4.4.10 SEPT-8:Find all locations for a product

This step groups RDD pairs of (product, location) by grouping of products. We use JavaPairRDD.groupByKey() to accomplish this step. This step does some basic debugging too by calling JavaPairRDD.collect() function.

4.4.11 SEPT-9:Finalize output by changing "value"

STEP-8 produced a JavaPairRDD<String, List<String>> object, where key is product (as a String) and value is a List<String>, which a list of locations (but might have duplicates). To removeduplicate elements from a vlue, we use a JavaPairRDD.mapValues() function. We implement this function by converting a List<String> to a Set<String>. Note that the keys are not altered. Mapping values are implemented by a Function(T, R).call(), where T is an input (as List<String>) and R is an output (as Tuple2<Set<String>, Integer>).

4.4.12 SEPT-10:Print the final result RDD

The final step emits the results by using JavaPairRDD.collect() method.

跳过“4.4.13 Running Spark Solution”以及“4.5 Running Spark on YARN”。

4.6 Left Outer join by Spark's leftOuterjoin()

This section solves the left outer join by using Spark’s built

in JavaPairRDD.leftOuterJoin() method.Using Sparks’s JavaPairRDD.leftOuterJoin() method enable us:

  1. To avoid the costly JavaPairRDD.union() operation between users and transactions.
  2. To avoid introducing custom flags such as ‖L‖ for location and "P" for products.
  3. To avoid extra RDD transformations to separate custom flags from each other.

    Using JavaPairRDD.leftOuterJoin() method enable us to produce the result efficiently. The transactionsRDD is the left table and usersRDD is the right table.

使用Spark内置的JavaPairRDD.leftOuterJoin()方法来解决这个问题,更加方便高效。不过左外连接之后还包含users_id,要记得去掉。然后依旧蛮把代码贴上来。

4.6.1 High-Level Steps

4.6.2 SEPT-0:import required classes and interfaces

4.6.3 SEPT-1:read input parameters

4.6.4 SEPT-2:create Spark's context object

4.6.5 SEPT-3:create RDD for user's data

This step creates usersRDD, which is a set of (userID, location) pairs. The usersRDD represents the "right" table for the left outer join operation.

4.6.6 SEPT-4:Create usersRDD:The "right" Table

This step create the right table represented as usersRDD, which contain (K=userID,V=location) pairs from users input data.

4.6.7 SEPT-5:create transactionRDD for transaction's data

This step creates transactionRDD, which is a set of (userID, product) pairs. The transactionRDD represents the "left" table for the left outer join operation.

4.6.8 SEPT-6:Create transactionsRDD:The Left Table

This step create the left table represented as transactionsRDD, which contain (K=userID,V=product) pairs from transactions input data.

4.6.9 SEPT-7:use Spark's bulit-in JavaPairRDD.leftOutputJoin() method

This is core step for performing the left outer join operation by using Spark’s JavaPairRDD.leftOuterJoin() method.

4.6.10 SEPT-8:create (product, location)pairs

This step builds another JavaPairRDD, which contains (K=product, V=location) pairs. Note that we completely ignored the userIDs, since we are only interested in products and their unique user’s locations.

4.6.11 SEPT-9:group (k=product, V=location) pairs by K

This step groups (K=product, V=location) pairs by K. The result will be (K, V2) where V2 is a list of locations (will have duplicate locations).

4.6.12 SEPT-10:create final output (K=product, V=Set(location))

This final step removes duplicate locations and creates (K,V2), where V2 is a Tuple2<Set<location>,size>.

跳过“4.6.13 Sample Run by YARN”。

最后再来稍微总结一下,同样因为蛮贴了实现代码,博文看起来很长,实际上没什么内容。这章讨论的左外连接问题,在数据库课上多少有涉及到一点,理解起来并没有什么问题。不同的是,这次的数据是大数据,SQL并不适合,讨论的是如何用Hadoop和Spark来解决左外连接问题。书上一共介绍了三种方法来解决示例,但是感觉前面两种并不是通用的方法。示例是说有两张表,一张记录着用户ID和地点ID,另一张记录着用户ID,产品ID以及其他一些业务信息,问题是找出产品被卖到哪些地点,即输出键值对(product_id, location_id)。解决的思路,肯定大数据首先要分块,交给很多的mapper来执行,从不同的数据中提取出我们需要的信息,即从用户表输出(user_id, location_id),从业务表输出(user_id, product_id)。这两个输出都传到reducer,按照user_id左外连接起来,最后输出结果。不过,没有经过处理的两个输出,最后reducer收到的键值对之间没有任何内在联系,我们没办法区分user_id后面是地点信息还是产品信息。所以前面两种方法的mapper的输出实际上没有这么简单,添加了一些东西,用来二次排序以及区分地点和产品。使得最后到达redecer的键值对不仅按user_id排序,并且每个user_id后面的第一个信息是地点,这样在遇到下一个user_id之前的信息都是销售到该地的产品,输出结果就变得很容易了。之所以说不是通用方法,因为这里user_id只对应着一个地点,要是不止一个地点呢。最后的方法用到了Spark里的leftOuterjoin()方法,我们只要去掉连接后的user_id,再稍微按product_id整理一下就可以输出了。因为左外连接直接用方法实现,而不像前面两种有针对性地特别写,感觉是通用的方法。至于具体的实现代码,依然蛮贴蛮看,虽然打不出来。运行示例运行出来肯定都是对的,感觉没什么必要贴上来。以上。

Chapter 4 Left Outer Join in MapReduce的更多相关文章

  1. hive left outer join的问题

    最近BA用户反馈有两句看似很像的语句返回的结果数不一样,比较奇怪,怀疑是不是Hive的Bug Query 1 返回结果数6071 select count(distinct reviewid) as ...

  2. left outer join的on不起作用

    left outer join的on不起作用 Why and when a LEFT JOIN with condition in WHERE clause is not equivalent to ...

  3. HIVE中join、semi join、outer join

    补充说明 left outer join where is not null与left semi join的联系与区别:两者均可实现exists in操作,不同的是,前者允许右表的字段在select或 ...

  4. Oracle Partition Outer Join 稠化报表

    partition outer join实现将稀疏数据转为稠密数据,举例: with t as (select deptno, job, sum(sal) sum_sal from emp group ...

  5. SQL Server 2008 R2——使用FULL OUTER JOIN实现多表信息汇总

    =================================版权声明================================= 版权声明:原创文章 谢绝转载  请通过右侧公告中的“联系邮 ...

  6. SQL中inner join、outer join和cross join的区别

    对于SQL中inner join.outer join和cross join的区别简介:现有两张表,Table A 是左边的表.Table B 是右边的表.其各有四条记录,其中有两条记录name是相同 ...

  7. SQL的inner join、left join、right join、full outer join、union、union all

    主题: SQL的inner join.left join.right join.full outer join.union.union all的学习. Table A和Table B表如下所示: 表A ...

  8. Outer Join Query Over Dblink Can Fail With ORA-904 (Doc ID 730256.1)

    Outer Join Query Over Dblink Can Fail With ORA-904 (Doc ID 730256.1) To Bottom Modified:03-May-2013T ...

  9. 图解SQL的inner join、left join、right join、full outer join、union、union all的区别

    转自:http://blog.csdn.net/jz20110918/article/details/41806611 假设我们有两张表.Table A 是左边的表.Table B 是右边的表.其各有 ...

随机推荐

  1. 图解ARP协议(四)代理ARP原理与实践(“善意的欺骗”)

    一.代理ARP概述 我:当电脑要访问互联网上的服务器,目标MAC是什么? 很多小伙伴在刚学习网络协议的时候,经常这样直接回应:不就是服务器的MAC嘛! 这时我会反问:那电脑怎么拿到这个服务器的MAC地 ...

  2. GB2312编码(为什么要加2020H、8080H,外码→内码→交换码→字形码)

    为什么要加上2020H和8080H? 区位码.内码.国标码怎么转换非常简单,但是令人迷惑的是为什么要那么转换?这种转换不可能平白无故地那样转换! 我搜索很多资料,找到最好的解释,总结如下: 首先,注意 ...

  3. nginx 反向代理https

    nginx 反向代理https   原来我用vertx创建了一个https apiserver,想着用nginx反向代理一下.证书是阿里云上免费一年的. 后来发现nginx要反向代理https自己也必 ...

  4. VSCode基本配置

    功能:保存时自动按ESLint规范格式化代码 + stylus配置(需安装Manta's Stylus Supremacy插件) { "editor.fontSize": 18, ...

  5. LOJ #6022. 重组病毒

    Description 支持以下操作 1.access一个点 2.问一个点上面的重链的个数 3.换根 , 并access原来的根 Solution 对于重链个数 , 我们在 \(access\) 的时 ...

  6. 删除Panl控件中窗体的方法

    //删除窗体方法 private void CloseFrm() { foreach (Control item in panel1.Controls) { if (item is Form) //判 ...

  7. 961 -尺寸2N阵列中的N重复元素

    在一个A大小的数组中2N,有N+1独特的元素,这些元素中的一个重复N次. 返回重复N次的元素. 例1: 输入:[1,2,3,3] 输出:3 例2: 输入:[2,1,2,5,3,2] 输出:2 例3: ...

  8. C#学习笔记-外观模式

    题目:股民买卖股票 实现: static void Main(string[] args) { Stock1 gu1 = new Stock1(); Stock2 gu2 = new Stock2() ...

  9. slider轮播插件的多种写法

    slider轮播插件相信大家经常会用到,写法也是各种各样,大部分都是用的第三方提供的组件使用,如果想基于自己的业务特点封装个组件供自己使用的话就要自己封装组件了,网上看了大部分写法都是通过js控制do ...

  10. Redis ,Memcached ,Mongodb 对比

    memcached: 1.适合做内存缓存,对可靠性没有要求,不支持持久化:速度快.并发高.2.支持的数据结构单一,只支持Key-value,3.value最大支持1M3.在传统tomcat 部署war ...