2.1 Introduction

MapReduce framework sorts input to reducers by key, but values of reducers are arbitrarily ordered. This means that if all mappers generated the following (key-value) pairs for key = K: (K, V1), (K, V2), ..., (K, Vn). Then all these values {V1, V2, ..., Vn} will be processed by a single reducer (for key = K), but there is no order (ascending or descending) between Vi‘s. Secondary sorting is a design pattern which will put some kind or ordering (such as "ascending sort" or "descending sort") among the values Vi‘s. How do we accomplish this? That is we want to have some order between the reducer values:
S1<=S2<=...<=Sn
or
S1>=S2>=...>=Sn
where Si∈{V1, V2, ..., Vn} for i = {1, 2, ..., n}. Note that each Vi might be a simple data type such as String or Integer or a tuple (more than a single value – a composite object).

仍然在说“二次排序”问题,简单来说就是字面意思的排两次序嘛。在 MapReduce 框架里,mappers 输出的键值对会按照键排序,然后作为 reducers 的输入。于是,具有同一键值的键值对会被送到同一个 reducer 分析,但是键值对里的值则是随机的。然后,我们希望这些值之间也能有一定顺序,即通过所谓“二次排序”来实现。至于为什么这么希望,倒是没有详细说,不过想想第一章那个从气象数据找年最高温的例子,mappers 从数据中提取出键值对(年份,气温),要是值(气温)再按序拍好传给 reducer ,那问题好像会简单点。所以,大概是某些时候这样方式会更优吧。还有要明确的是键值对里的类型可以是复合的,之前例子就有 value 是 tuple(time, value)的情况。

There are two ways to have sorted values for reducer values:

  • Solution-1: Buffer reducer values in memory, then sort. If the number of reducer values are small enough, so they can fit in memory (perreducer), then this solution will work. But if the the number of reducer values are high, then these values might not fit in memory (not a preferable solution). Implementation of this solution is trivial and will not be discussed in this chapter.
  • Solution-2: Use “secondary sorting” design pattern of MapReduce framework and reducer values will arrive sorted to a reducer (no need to sort values in memory). This technique uses the shuffle and sort technique of MapReduce framework to perform sorting of reducer
    values. This technique is preferable to Solution-1 because you do not depend on the memory for sorting (and if you have too many values, then Solution1 might not be a viable option). The rest of this chapter will focus on presenting Solution-2. We present
    implementation of Solution-2 in Hadoop by using
  • Old Hadoop API
    (using org.apache.hadoop.mapred.JobConf and org.apache.hadoop.mapred.*);
    I intentionally included Hadoop‘s old API if in case you are using an old API and have not migrated to new Hadoop API.
  • New Hadoop API (using org.apache.hadoop.mapreduce.Job and
    org.apache.hadoop.mapreduce.lib.*)

同样是上一章说到的两个解决办法,这一章打算更加具体地讲如何用 Hadoop 实现第二种方法,还很贴心的新旧 API 都讲(我感觉就是...跳过)。第二种方法的原理细节,像“This technique uses the shuffle and sort technique of MapReduce framework to perform sorting of reducer values.”仍然不是很懂。这个方法不用 reducer 来二排,传到 reducer 的键值对最后组起来就是二排好的,不明觉厉。

2.2 Secondary Sorting Technique

Let’s have the following values for key = K:
(K, V1), (K, V2),..., (K, Vn).
and further assume that each Vi is a tuple of m attributes as:
(ai1, ai2,..., aim).
where we want to sort reducer’s tuple values by ai1. We will denote (ai2, ..., aim) (the remaining attributes) by r. Therefore, we can express reducer values as:
(K, (a1, r1)), (K, (a2, r2)),..., (K, (an, rn)).
To sort the reducer values by ai, we create a composite key: (K, ai). Our new mappers will emit the following (key, value) pairs for key = K.

Key Value
(K, a1) (a1, r1)
(K, a2) (a2, r2)
... ...
(K, an) (an, rn)

So the "composite key" is (K, ai) and the "natural key" is K. Defining the composite key (by adding the attribute ai to the "natural key" where the values will be sorted on) enables us to sort the reducer values by the MapReduce framework, but when we want to partition keys, we will partition it by the "natural key"
(as K). Below "composite key" and "natural key" are presented visually in 2.2.

把你二排需要用到的值和“natural key”一起定义成“compisite key”, MapReduce 框架就可以实现二排。好像又很简单的样子,不过那么多 mapper ,最后结果是不是也要处理。不过实际上,你也需要写不少东西(如下)告诉框架怎么来排。

Since we defined a "composite key" (composed of "natural key" (as K) and an attribute (as ai) where the reducer values will be sorted on), we have to tell the MapReduce framework how to sort the keys by usinga "composite key" (comprised of two fields: K and ai): for this we need to define a plug-in sort class, CompositeKeyComparator, which will be sorting the Composite Keys. This is how you plug-in this comparator class to a MapReduce framework:

The CompositeKeyComparator class is telling to the MapReduce framework how to sort the composite keys (comprised of two fields: K and ai). The implementation is provided below, which compares two WriteableComparables objects (representing a CompositeKey object).


The next piece of plug-in class is a "natural key partitioner" class (let’s call this NaturalKeyPartitioner class), which will implement the org.apache.hadoop.mapred.Partitio interface. This is how we plug-in the class to the MapReduce framework:

Next, we define the Natural Key Partitioner class:

The last piece to plugin is NaturalKeyGroupingComparator, which considers the natural key. This class just compares two natural keys. This is how you plug-in the class to the MapReduce framework:

This is how you define the NaturalKeyGroupingComparator class:

往框架里写了不少东西,有CompositeKeyComparator来告诉框架怎么对自己定义的“composite key”进行排序,这个可以理解(当然不是代码层面上的)。不过还有NaturalKeyPartitioner用来提取“natural key”以及NaturalKeyGroupingComparator用来按“natural key”进行分组。所以是你给框架新写了个类来告诉它怎么按“composite key”排序,原来的按键排序输出给覆盖了,还要自己把“natural key”提取出来再分个组,然后就可以实现“二排”(纯属个人猜测)。

2.3 Complete Example of Secondary Sorting

2.3.1 Problem Statement

Consider the following data:
Stock-Symbol Date Closed-Price
and assume that we want to generate the following output data per stock-symbol:
Stock-Symbol: (Date1, Price1)(Date2, Price2)...(Daten, Pricen)
where
Date1<=Date2<=...<=Daten.
That is we want the reducer values to be sorted by the date of closed price. This can be accomplished by "secondary sorting".

又举了一个完整的例子来说明二排,想要的输出是数据按 “Stock-Symbol” 分类,并且每类元素按 “Date” 升序。

2.3.2 Input Format

We assume that input data is in CSV (Comma-separated Value) format:
Stock-Symbol,Date,Colsed-Price
for example:
ILMN,2013-12-05,97.65
GOOG,2013-12-09,1078.14
IBM,2013-12-09,177.46
ILMN,2013-12-09,101.33
ILMN,2013-12-06,99.25
GOOG,2013-12-06,1069.87
IBM,2013-12-06,177.67
GOOG,2013-12-05,1057.34

输入的格式是 CSV (用逗号分隔的值)。

2.3.3 Output Format

We want our output to be sorted by "date of closed price": for our sample input, our desired output is listed below:
ILMN: (2013-12-05,97.65)(2013-12-06,99.25)(2013-12-09,101.33)
GOOG: (2013-12-05,1057.34)(2013-12-06,1069.87)(2013-12-09,1078.14)
IBM: (2013-12-06,177.67)(2013-12-09,177.46)

2.3.4 Composite Key

2.3.4.1 Composite Key Definition

The Composite Key Definition is implemented as a CompositeKey class, which implements the WritableComparable<CompositeKey> interface.

2.3.4.1 Composite Key Comparator Definiton

Composite Key Comparator Definition is implemented by the CompositeKeyComparator class which compares two CompositeKey objects by implementaing the compare() method. The compare() method returns 0 if they are identical, returns -1 if the first composite key is smaller than the second one, otherwise returns +1.

2.3.5 Sample Run

2.3.5.1 Implementation Classes using Old Hadoop API

跳过实际运行的“2.3.5.2 Input,2.3.5.3 Running MapReduce Job以及2.3.5.4 Output”。

2.4 Secondary Sorting using New Hadoop API

2.4.0.5 Implementation Classes using New API


WritableComparable(s) can be compared to each other, typically via Comparator(s). Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.

同样跳过“2.4.0.6 Input,2.4.0.7 Running MapReduce Job以及2.4.0.8 Output of MapRedue Job”。

第二章就像名字一样,又举了个更加详细的例子来说明解决“二次排序”问题的第二个方法怎么用 Hadoop 来实现,大概是又加深了一点对其原理的理解,这也是我的主要目的。至于其具体的实现,代码和新旧 API 是蛮贴蛮看,实际还差的远,自己是打不出来的。书上的实际运行,因为觉得对现在没什么必要,也就没有贴。不过稍微看了下,大概就是直接运行run.sh这个脚本,然后检查输入输出文件用的是cat,只有执行时的 Log 比较迷。问题不大,继续往下。

Chapter 2 Secondary Sorting:Detailed Example的更多相关文章

  1. Chapter 1 Secondary Sorting:Introduction

    开始学习<数据算法:Hadoop/Spark大数据处理技巧>第1-5章,假期有空就摘抄下来,毕竟不是纸质的可以写写画画,感觉这样效果好点,当然复杂的东西仍然跳过.写博客越发成了做笔记的感觉 ...

  2. MongoDB系列四:解决secondary的读操作

    http://blog.itpub.net/26812308/viewspace-2124660/ 在Replica sets 中的secondary节点默认是不可读的.使用Replica Sets实 ...

  3. TestLink学习一:Windows搭建Apache+MySQL+PHP环境

    PHP集成开发环境有很多,如XAMPP.AppServ......只要一键安装就把PHP环境给搭建好了.但这种安装方式不够灵活,软件的自由组合不方便,同时也不利于学习.所以我还是喜欢手工搭建PHP开发 ...

  4. MongoDB副本集学习(二):基本测试与应用

    简单副本集测试 这一节主要对上一节搭建的副本集做一些简单的测试. 我们首先进入primary节点(37017),并向test.test集合里插入10W条数据: . rs0:PRIMARY> ;i ...

  5. QAction类详解:

    先贴一段描述:Qt文档原文: Detailed Description The QAction class provides an abstract user interface action tha ...

  6. Union - Find 、 Adjacency list 、 Topological sorting Template

    Find Function Optimization: After Path compression: int find(int x){ return root[x] == x ? x : (root ...

  7. Java编程思想总结笔记The first chapter

    总觉得书中太啰嗦,看完总结后方便日后回忆,本想偷懒网上找别人的总结,无奈找不到好的,只好自食其力,尽量总结得最好. 第一章  对象导论 看到对象导论觉得这本书 目录: 1.1 抽象过程1.2 每个对象 ...

  8. centos6上yum安装drbd(内核:2.6.32.696)

    author:headsen  chen date: 2017-11-20  15:11:21 notice: 个人原创,转载请注明,否则依法追究法律责任 前期准备: 两台机器:配置主机名分别为: l ...

  9. Mysql相关知识点梳理(一):优化查询

    EXPLAIN解析SELECT语句执行计划: EXPLAIN与DESC同义,通过它可解析MySQL如何处理SELECT,提供有关表如何联接和联接的次序,还可以知道什么时候必须为表加入索引以得到一个使用 ...

随机推荐

  1. 数据库sqlite3在linux中的使用

    在linux下我们首先要获取root权限 当然也可是使用 sudo命令 接着让我们来安装sqlite3吧!博主当然是已经安装好了! 别急,的确你是安装好了sqlite3但是有一点必须要记住,你还没有安 ...

  2. 【转】WinForm窗体显示和窗体间传值

    以前对WinForm窗体显示和窗体间传值了解不是很清楚 最近做了一些WinForm项目,把用到的相关知识整理如下 A.WinForm中窗体显示 显示窗体可以有以下2种方法: Form.ShowDial ...

  3. 常用工具说明--mongodb、mysql解压版、IDEA配置maven

    Mongodb的安装.配置 1.去官网下载mongodb安装包,mongodb官网.点击右上角的 Download,下载对应的msi安装包 2.安装程序,选择 Custom,自定义安装路径,比如安装在 ...

  4. 问题集录--java读写Excel

    使用JXL.rar 1.找到JXL.jar包,导入程序. 查找依赖的网址:Maven仓库 2.读取Excel public static void readExcel() throws BiffExc ...

  5. css样式重置reset

    /* reset */ body,h1,h2,h3,h4,p,dl,dd,ul,ol,form,input,textarea,th,td,select{margin: 0;padding: 0;} e ...

  6. shell脚本:行列转换

    Mybatis中写sql,如select,会涉及到一长串列名. `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(100) COLLATE ut ...

  7. WebUtility(提供在处理 Web 请求时用于编码和解码 URL 的方法。)

    public static string UrlEncode( string str ) UrlEncode(String) 方法可用来编码整个 URL,包括查询字符串值. 如果没有编码情况下,如空格 ...

  8. C# 之构造函数

    构造函数是一种特殊的成员函数,它主要用于为对象分配存储空间,对数据成员进行初始化. 构造函数具有一些特殊的性质: (1)构造函数的名字必须与类同名; (2)构造函数没有返回类型,它可以带参数,也可以不 ...

  9. 使用Tensorflow和MNIST识别自己手写的数字

    #!/usr/bin/env python3 from tensorflow.examples.tutorials.mnist import input_data mnist = input_data ...

  10. 【SSH网上商城项目实战18】过滤器实现购物登录功能的判断

    转自:https://blog.csdn.net/eson_15/article/details/51425010 上一节我们做完了购物车的基本操作,但是有个问题是:当用户点击结算时,我们应该做一个登 ...