2.1 Introduction

MapReduce framework sorts input to reducers by key, but values of reducers are arbitrarily ordered. This means that if all mappers generated the following (key-value) pairs for key = K: (K, V1), (K, V2), ..., (K, Vn). Then all these values {V1, V2, ..., Vn} will be processed by a single reducer (for key = K), but there is no order (ascending or descending) between Vi‘s. Secondary sorting is a design pattern which will put some kind or ordering (such as "ascending sort" or "descending sort") among the values Vi‘s. How do we accomplish this? That is we want to have some order between the reducer values:
S1<=S2<=...<=Sn
or
S1>=S2>=...>=Sn
where Si∈{V1, V2, ..., Vn} for i = {1, 2, ..., n}. Note that each Vi might be a simple data type such as String or Integer or a tuple (more than a single value – a composite object).

仍然在说“二次排序”问题,简单来说就是字面意思的排两次序嘛。在 MapReduce 框架里,mappers 输出的键值对会按照键排序,然后作为 reducers 的输入。于是,具有同一键值的键值对会被送到同一个 reducer 分析,但是键值对里的值则是随机的。然后,我们希望这些值之间也能有一定顺序,即通过所谓“二次排序”来实现。至于为什么这么希望,倒是没有详细说,不过想想第一章那个从气象数据找年最高温的例子,mappers 从数据中提取出键值对(年份,气温),要是值(气温)再按序拍好传给 reducer ,那问题好像会简单点。所以,大概是某些时候这样方式会更优吧。还有要明确的是键值对里的类型可以是复合的,之前例子就有 value 是 tuple(time, value)的情况。

There are two ways to have sorted values for reducer values:

  • Solution-1: Buffer reducer values in memory, then sort. If the number of reducer values are small enough, so they can fit in memory (perreducer), then this solution will work. But if the the number of reducer values are high, then these values might not fit in memory (not a preferable solution). Implementation of this solution is trivial and will not be discussed in this chapter.
  • Solution-2: Use “secondary sorting” design pattern of MapReduce framework and reducer values will arrive sorted to a reducer (no need to sort values in memory). This technique uses the shuffle and sort technique of MapReduce framework to perform sorting of reducer
    values. This technique is preferable to Solution-1 because you do not depend on the memory for sorting (and if you have too many values, then Solution1 might not be a viable option). The rest of this chapter will focus on presenting Solution-2. We present
    implementation of Solution-2 in Hadoop by using
  • Old Hadoop API
    (using org.apache.hadoop.mapred.JobConf and org.apache.hadoop.mapred.*);
    I intentionally included Hadoop‘s old API if in case you are using an old API and have not migrated to new Hadoop API.
  • New Hadoop API (using org.apache.hadoop.mapreduce.Job and
    org.apache.hadoop.mapreduce.lib.*)

同样是上一章说到的两个解决办法,这一章打算更加具体地讲如何用 Hadoop 实现第二种方法,还很贴心的新旧 API 都讲(我感觉就是...跳过)。第二种方法的原理细节,像“This technique uses the shuffle and sort technique of MapReduce framework to perform sorting of reducer values.”仍然不是很懂。这个方法不用 reducer 来二排,传到 reducer 的键值对最后组起来就是二排好的,不明觉厉。

2.2 Secondary Sorting Technique

Let’s have the following values for key = K:
(K, V1), (K, V2),..., (K, Vn).
and further assume that each Vi is a tuple of m attributes as:
(ai1, ai2,..., aim).
where we want to sort reducer’s tuple values by ai1. We will denote (ai2, ..., aim) (the remaining attributes) by r. Therefore, we can express reducer values as:
(K, (a1, r1)), (K, (a2, r2)),..., (K, (an, rn)).
To sort the reducer values by ai, we create a composite key: (K, ai). Our new mappers will emit the following (key, value) pairs for key = K.

Key Value
(K, a1) (a1, r1)
(K, a2) (a2, r2)
... ...
(K, an) (an, rn)

So the "composite key" is (K, ai) and the "natural key" is K. Defining the composite key (by adding the attribute ai to the "natural key" where the values will be sorted on) enables us to sort the reducer values by the MapReduce framework, but when we want to partition keys, we will partition it by the "natural key"
(as K). Below "composite key" and "natural key" are presented visually in 2.2.

把你二排需要用到的值和“natural key”一起定义成“compisite key”, MapReduce 框架就可以实现二排。好像又很简单的样子,不过那么多 mapper ,最后结果是不是也要处理。不过实际上,你也需要写不少东西(如下)告诉框架怎么来排。

Since we defined a "composite key" (composed of "natural key" (as K) and an attribute (as ai) where the reducer values will be sorted on), we have to tell the MapReduce framework how to sort the keys by usinga "composite key" (comprised of two fields: K and ai): for this we need to define a plug-in sort class, CompositeKeyComparator, which will be sorting the Composite Keys. This is how you plug-in this comparator class to a MapReduce framework:

The CompositeKeyComparator class is telling to the MapReduce framework how to sort the composite keys (comprised of two fields: K and ai). The implementation is provided below, which compares two WriteableComparables objects (representing a CompositeKey object).


The next piece of plug-in class is a "natural key partitioner" class (let’s call this NaturalKeyPartitioner class), which will implement the org.apache.hadoop.mapred.Partitio interface. This is how we plug-in the class to the MapReduce framework:

Next, we define the Natural Key Partitioner class:

The last piece to plugin is NaturalKeyGroupingComparator, which considers the natural key. This class just compares two natural keys. This is how you plug-in the class to the MapReduce framework:

This is how you define the NaturalKeyGroupingComparator class:

往框架里写了不少东西,有CompositeKeyComparator来告诉框架怎么对自己定义的“composite key”进行排序,这个可以理解(当然不是代码层面上的)。不过还有NaturalKeyPartitioner用来提取“natural key”以及NaturalKeyGroupingComparator用来按“natural key”进行分组。所以是你给框架新写了个类来告诉它怎么按“composite key”排序,原来的按键排序输出给覆盖了,还要自己把“natural key”提取出来再分个组,然后就可以实现“二排”(纯属个人猜测)。

2.3 Complete Example of Secondary Sorting

2.3.1 Problem Statement

Consider the following data:
Stock-Symbol Date Closed-Price
and assume that we want to generate the following output data per stock-symbol:
Stock-Symbol: (Date1, Price1)(Date2, Price2)...(Daten, Pricen)
where
Date1<=Date2<=...<=Daten.
That is we want the reducer values to be sorted by the date of closed price. This can be accomplished by "secondary sorting".

又举了一个完整的例子来说明二排,想要的输出是数据按 “Stock-Symbol” 分类,并且每类元素按 “Date” 升序。

2.3.2 Input Format

We assume that input data is in CSV (Comma-separated Value) format:
Stock-Symbol,Date,Colsed-Price
for example:
ILMN,2013-12-05,97.65
GOOG,2013-12-09,1078.14
IBM,2013-12-09,177.46
ILMN,2013-12-09,101.33
ILMN,2013-12-06,99.25
GOOG,2013-12-06,1069.87
IBM,2013-12-06,177.67
GOOG,2013-12-05,1057.34

输入的格式是 CSV (用逗号分隔的值)。

2.3.3 Output Format

We want our output to be sorted by "date of closed price": for our sample input, our desired output is listed below:
ILMN: (2013-12-05,97.65)(2013-12-06,99.25)(2013-12-09,101.33)
GOOG: (2013-12-05,1057.34)(2013-12-06,1069.87)(2013-12-09,1078.14)
IBM: (2013-12-06,177.67)(2013-12-09,177.46)

2.3.4 Composite Key

2.3.4.1 Composite Key Definition

The Composite Key Definition is implemented as a CompositeKey class, which implements the WritableComparable<CompositeKey> interface.

2.3.4.1 Composite Key Comparator Definiton

Composite Key Comparator Definition is implemented by the CompositeKeyComparator class which compares two CompositeKey objects by implementaing the compare() method. The compare() method returns 0 if they are identical, returns -1 if the first composite key is smaller than the second one, otherwise returns +1.

2.3.5 Sample Run

2.3.5.1 Implementation Classes using Old Hadoop API

跳过实际运行的“2.3.5.2 Input,2.3.5.3 Running MapReduce Job以及2.3.5.4 Output”。

2.4 Secondary Sorting using New Hadoop API

2.4.0.5 Implementation Classes using New API


WritableComparable(s) can be compared to each other, typically via Comparator(s). Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.

同样跳过“2.4.0.6 Input,2.4.0.7 Running MapReduce Job以及2.4.0.8 Output of MapRedue Job”。

第二章就像名字一样,又举了个更加详细的例子来说明解决“二次排序”问题的第二个方法怎么用 Hadoop 来实现,大概是又加深了一点对其原理的理解,这也是我的主要目的。至于其具体的实现,代码和新旧 API 是蛮贴蛮看,实际还差的远,自己是打不出来的。书上的实际运行,因为觉得对现在没什么必要,也就没有贴。不过稍微看了下,大概就是直接运行run.sh这个脚本,然后检查输入输出文件用的是cat,只有执行时的 Log 比较迷。问题不大,继续往下。

Chapter 2 Secondary Sorting:Detailed Example的更多相关文章

  1. Chapter 1 Secondary Sorting:Introduction

    开始学习<数据算法:Hadoop/Spark大数据处理技巧>第1-5章,假期有空就摘抄下来,毕竟不是纸质的可以写写画画,感觉这样效果好点,当然复杂的东西仍然跳过.写博客越发成了做笔记的感觉 ...

  2. MongoDB系列四:解决secondary的读操作

    http://blog.itpub.net/26812308/viewspace-2124660/ 在Replica sets 中的secondary节点默认是不可读的.使用Replica Sets实 ...

  3. TestLink学习一:Windows搭建Apache+MySQL+PHP环境

    PHP集成开发环境有很多,如XAMPP.AppServ......只要一键安装就把PHP环境给搭建好了.但这种安装方式不够灵活,软件的自由组合不方便,同时也不利于学习.所以我还是喜欢手工搭建PHP开发 ...

  4. MongoDB副本集学习(二):基本测试与应用

    简单副本集测试 这一节主要对上一节搭建的副本集做一些简单的测试. 我们首先进入primary节点(37017),并向test.test集合里插入10W条数据: . rs0:PRIMARY> ;i ...

  5. QAction类详解:

    先贴一段描述:Qt文档原文: Detailed Description The QAction class provides an abstract user interface action tha ...

  6. Union - Find 、 Adjacency list 、 Topological sorting Template

    Find Function Optimization: After Path compression: int find(int x){ return root[x] == x ? x : (root ...

  7. Java编程思想总结笔记The first chapter

    总觉得书中太啰嗦,看完总结后方便日后回忆,本想偷懒网上找别人的总结,无奈找不到好的,只好自食其力,尽量总结得最好. 第一章  对象导论 看到对象导论觉得这本书 目录: 1.1 抽象过程1.2 每个对象 ...

  8. centos6上yum安装drbd(内核:2.6.32.696)

    author:headsen  chen date: 2017-11-20  15:11:21 notice: 个人原创,转载请注明,否则依法追究法律责任 前期准备: 两台机器:配置主机名分别为: l ...

  9. Mysql相关知识点梳理(一):优化查询

    EXPLAIN解析SELECT语句执行计划: EXPLAIN与DESC同义,通过它可解析MySQL如何处理SELECT,提供有关表如何联接和联接的次序,还可以知道什么时候必须为表加入索引以得到一个使用 ...

随机推荐

  1. Group by、having、order by、Distinct 使用注意事项

    直奔主题,如下SQL语句 SELECT COUNT(*) AS COUNT,REQUEST,METHOD FROM REQUESTMETH GROUP BY REQUEST,METHOD HAVING ...

  2. SharePoint 2013的REST编程基础

    1. SharePoint 2013对REST编程的支持 自从SharePoint2013开始, SharePoint开始了对REST 编程的支持,这样除了.NET , Silverlight, Po ...

  3. Hudson-ci/Installing Hudson Windows Service---官方文档

    < Hudson-ci Hudson Continuous Integration Server Website Download Community Mailing List • Forums ...

  4. [PY3]——内置数据结构(7)——字典及其常用操作

    字典及其常用操作Xmind图 关于字典 字典是一种key-value结构 字典是无序的 字典的定义 # {}大括号可以直接定义一个空字典 In [1]: d={};type(d) Out[1]: di ...

  5. vue中添加Echarts图表的使用,Echarts的学习笔记

    项目中需要使用一些折线图.柱状图.饼状图等等,之前使用过heightCharts(关于heightCharts请看我的另一篇 http://www.cnblogs.com/jasonwang2y60/ ...

  6. JavaScript中有对字符串编码的三个函数:escape,encodeURI,encodeURIComponent

    JavaScript中有三个可以对字符串编码的函数,分别是: escape,encodeURI,encodeURIComponent,相应3个解码函数:unescape,decodeURI,decod ...

  7. 整理一些前端开发比较好的node包

    1.一个base64转换的包 !https://www.npmjs.com/package/js-base64 2 js 压缩 !https://www.npmjs.com/package/uglif ...

  8. 撩课-Java每天10道面试题第5天

    41.Iterator.ListIterator 和 Enumeration的区别? 迭代器是一种设计模式, 它是一个对象, 它可以遍历并选择序列中的对象, 而开发人员不需要了解 该序列的底层结构. ...

  9. redis 学习(一)

    一.Redis概述 1.什么是NoSql NoSQL(NoSQL = Not Only SQL ),意即“不仅仅是SQL”,它泛指非关系型的数据库. 随着互联网2003年之后web2.0网站的兴起,传 ...

  10. hdu 2199 Can you solve this equation? 二分

    Can you solve this equation? Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K ( ...