http://blog.csdn.net/pipisorry/article/details/48443533

海量数据挖掘Mining Massive Datasets(MMDs) -Jure Leskovec courses学习笔记之MapReduce

{A programming system for easily implementing parallel algorithms on commodity clusters.}

Distributed File Systems分布式文件系统DFS

why we need Map-Reduce in the first place?

Classical Data Mining: Look at the disk in addition to looking at CPU and memory.So the data's on disk,you can only bring in a portion of the data into memory at a time.And you can process it in batches, and write back results to disk.

但是如果有类似google如此巨大的数据集上面的方面一部分一部分的读时间上还是不能忍受。

Machine Learning, Statistics: Now the obvious thing that you think of is that it can split the data into chunks.And you can have multiple disks and CPUs.you stripe the data across multiple disks.And you can read it, and process it in parallel
in multiple CPUs.

This is the fundamental idea of cluster computing.

cluster computing集群计算之架构

Note: a higher bandwidth switch can do two to ten gigabits between racks.So so we have 16 to 64 nodes in a rack.And then you, you rack up multiple racks,and you get a data center.So this is the standard classical architecture that has emerged
over the last few years.

cluster computing challenges集群计算的挑战

挑战1:节点失效

Note:Persistence means that once you store the data,you're guaranteed you can read it again.

So kind of need an infrastructure(下部构造) that can hide these kinds of node failures and let the computation go to go to completion even if nodes fail.

挑战2:网络带宽瓶颈

Note: a complex computation might need to move a lot of data, and that can slow the computation down.So you need a framework that you know,doesn't move data around so much while it's doing computation.

挑战3:分布式编程实现困难

Map-Reduce解决DFS的问题

map-reduce可以解决以上3个问题

redundant storage infrastructure冗余存储下部构造



Note:

1.  distributed file system stores data across a cluster, but stores each piece of data multiple times.这就解决了挑战1.

2. the data is very rarely updated in place, it's very, very often updated through appends.

Note:

1. 一个文件分成多个chunks,并冗余存储在不同chunk servers中(一般是重复3份)。

2. the machines themselves are called chunk servers in this context.

3. replicas of a chunk are never on the same chunk server.

chunk servers & compute servers块服务器的计算服务器

Bring computation to data: chunk servers also act as compute servers.And when, whenever your computation has to access data.That computation is actually scheduled on the chunk server that actually contains the data.This way you avoid moving
data to where the computation needs to run.这就解决了挑战2.

分布式文件系统三大组件

Note:

1. To keep at least one replica in a entirely different rack if possible: We do that because the most common scenario is that a single node can fail.But it's also possible that the switch on a rack can fail, and when the switch on a rack fails,the entire
rack becomes inaccessible.

2. when a client,or an algorithm that needs to access the data tries to access a file it goes through the client library.。。。And once that's done the client is directly connected to the chunk servers.Where it can access the data without going through the
master nodes.So the data access actually happens in peer-to-peer fashion without going through the master node.

皮皮Blog

The MapReduce Computational Model计算模型

warm up task热身任务

Task: word count词计数

Note:You can just build a Hash Table.build the index by word.The first time you see a word you initialize,you add an entry to the Hash Table,with that word, and set the count to 1.And every subsequent time you see the word,increment the
count by one.And you make a single sweep through the file.

Note:

1. Well you can try to write some kind of complicated code but I like to use Unix file system commands to do this.

2. here the the command words is a little script that goes through doc.txt,outputs the words in it one per line.And once you sort it all of the all occurrences of the same word come together.And when you do uniq dash c,it takes a run of the occurrence of
the same word.And then just counts the occurrences of the same word.So the output of this is going to be word count pairs.

3. unix自定义命令words(doc_name): goes through file doc_name and then output one word to a line

unix文档编辑命令uniq: 功能说明:检查及删除文本文件中重复出现的行列。

语  法:uniq [-cdu][-f<栏位>][-s<字符位置>][-w<字符位置>][--help][--version][输入文件][输出文件]

参  数:  -c或--count   在每列旁边显示该行重复出现的次数。...

4. 这个命令的实现是类似MapReduce的。

MapReduce overview概述

MapReduce的两大步骤

Note:

1. we wrote a script called words,that output one word to a line,is called a Map function in in, in, in MapReduce.The Map function scans the input file record-at-a-time.

2. each record pulls out something that you care about.In this case, it was words.And the things that you output are called keys.

3. grouped all the keys with the same value together.

4. the outline of this computation actually stays the same for any MapReduce computation.What changes is it that it change the Map function,the Reduce function, to the fit the problem that you're actually solving.

MapReduce:The Map step映射步

Apply the Map function to each input record, and create intermediate key-value pairs:

Note:

1. The intermediate key-value pairs need not have the same key as input key value-pairs.And there could be multiple of them.And the values although they look the same here, they both say v,the values could be different as well.

2. the Map function produced multiple intermediate key-value pairs.So there can be zero, one, or multiple intermediate key-value pairs,for each input key-value pair.

MapReduce:The Reduce step归约步

group intermediate key-value pairs by key:

Note: this is done by sorting by key and then by grouping together the value of.And these are all different values,although use the same same symbol v here.

MapReduce过程总结

MapReduce使用实例1:词计数

Note:

1. this whole example doesn't run on a single node.The data is actually distributed across multiple input nodes.The data's actually divided here into multiple nodes(here is 4).Now the Map tasks are going to be run on each of these four different nodes.And
the the outputs of those Map tasks will therefore be produced on four different nodes.

2. The system then copies the Map outputs onto a single node.And once the data has flowed to the single node,it can then sort it by key and then do the final reduce step.In practice you use multiple Reduce nodes as well, you can tell the system to use a
certain number of Reduce nodes.And it makes sure, that for any given key ,all instances, regardless of which Map node they start out from,always end up at the same Reduce node. And this is done by using a hash function.the system uses a hash function that
hashes each Map key and determines a single Reduce node to shift that tuple two.And this ensures that all tuples with the same key,end up with the same Reduce node.

3. The final result is perfectly fine because you're dealing with a distributed file system,which know, knows that your file is spread across three nodes of the system.So you can still access it as a single file in your client.And the system knows to access
the data from those three three independent nodes.

4. all this magic in the MapReduce magic is implemented to use,as far as possible, only sequential scans of disk as opposed to a random access is.how the Map function is applied on the input file record by record.How the sorting is done and so on.you can actually
implement,all of this by using only sequential reads of disk, and never using random accesses of disk.sequential reads are much much more efficient than random accesses to disk.And the whole MapReduce system is built around doing only sequential reads of files
and never random accesses.

词计数实现伪代码

MapReduce使用实例2:主机Host大小

Note: The problem, is for each host(not for each URL) we want to find the total number of bytes.there can be multiple many URLs with the same host name.

皮皮Blog

Scheduling and Data Flow调度和数据流

{go under the hood of a Map-Reduce system and understand how it actually works}

MapReduce diagram图解

Note: MapReduce过程图的解说:

In the Map step,you take a Big a document which is divided into chunks and you run a Map process on each chunk.And the map process go through each record in that chunk and outputs an intermediate key value pairs for each vector in that  chunk.

In the group by step, you group by key.you bring together all the values the same key.

And in the reduce step.You apply a reducer to each intermediate key value pair set.And you create a final output.

MapReduce在分布式系统中的实际工作图解

{The previous schematic was how MapReduce worked in a, in a centralized(集中式) system}

Note:

1. In a distributed system, you have multiple nodes and map and reduced tasks are running in pattern on multiple nodes.And producing it to be intermediate key value pairs on each of those nodes.once the intermediate key value pairs are produced, the underlying
system the Map-Reduce system uses a partitioning function which is just a hash function,to each intermediate key value.And the hash function will tell the Map-Reduce system which reduce node to send that key value pair to.

2. once the reduce task has received input from all from all the map tasks.its first job is to sort it's input, and group it together by key.

3. once that is done, the reduce task then works the reduce function which is provided by the programmer on each each such group and creates the final output.

MapReduce的环境



Note:

1. The programmer provides two functions, Map and Reduce, and specifies the input file.

2. Scheduling:Figuring out where the map tasks run,where the reduce tasks run, and so on.

MapReduce Scheduling and Data Flow调度和数据流



Note:

1. close: means that,the input data is a file divided into chunks.And there are replicas of the chunks on different chunk servers.The Map-Reduce system try to schedule each map task on a chunk server that holds a copy of the corresponding chunk.So, there's
no actual copy.A data copy associated with the map step of the Map-Reduce program.

2. There's some overhead to storing data in the DFS.there are multiple replicas of the data that need to be made.And network shuffling involved in storing new data in the distributed file system.So, whenever possible, intermediate results are actually stored
in the local file system of the Map and Reduced workers.

coordination : Master Node主节点的调和功能

Note:

1. When a map task completes,it sends the the master the location and sizes of it's the R intermediate files that it creates.why, R intermediate files? There's one intermediate file that's created for each reducer.Because the data, the output of the mapper
has to be shipped to each of the reducers,depending on the key value.So, whenever a map task completes,it stores the R intermediate files on it's local file system,and it let's the master know the names of those files.The master pushes this information to
the reducers.

2. Once the reducers know that all the mappers map tasks are completed,then they copy the intermediate file from each of the map tasks.And then they can proceed with their work.

Worker节点失效处理

Map worker失效处理

Note:

1. If a map worker fails, then all the map tasks that were scheduled on that map worker may have failed.

2. If a map worker fails,then the node fails.Then all intermediate output created by all the map tasks that have ran on that worker are lost.

Reduce worker失效处理

Note: only in-progress tasks need to be set to idle: The output of the reduced worker is a final output,it's written to the distribute file system.And not to the local file system of the reduced worker.The output is not lost even if the
reduce worker fails.

Master worker失效处理

Note: the client can then do something like restarting the map reduce task.The master is typically not applicated in the Map-Reduce system.

map和reduce任务数目设定的基本原则

Note:

1. If there is one map task per node in the cluster and during processing the node fails,then that map task needs to be rescheduled on another node in the cluster when it becomes available.Now since all the other nodes are processing,one of those nodes has
to complete before this map task can be scheduled on that node and so the entire computation is slowed down.

2. So, instead of one map task on a given node, there are many small map tasks on a given node, and if that node fails, then those map tasks can then be spread across all the available nodes and so the entire task will complete much faster.

3. R is usually even smaller than the total number of nodes in the system.The the output file is,is spread across spread across R node where R the number of reducers.And it's usually convenient to have the output spread across a small number of nodes rather
than across a large number of nodes.

皮皮Blog

Combiners And Partition Functions合成器和分区函数

{looked at how Map-Reduce model's actually implemented}

改良:Combiner合成器

Note:

1. 就是说在map阶段就聚合相同key的pairs,其中聚合函数(合成器)就类似于Reduce函数。这样paris传递到Reducer的网络开销就减少了。

2. The programmer provides a function combine. Instead of a whole bunch of tuples with the key k being shipped off to a reducer.Just a single tuple with key k and v2 is shipped off to the reducer now, usually the combiner is the same function as the reducer.

应用combiner的例子:词计数

Combiner和Reducer函数要满足的性质

Note: Reduce函数必须是可交换和可结合的,如sum函数。And because sum satisfies both these properties sum can be used as a combiner as well as a reducer.

2. Average函数就不满足这个性质,不能作为combiner函数。

但是Average可以这样实现:combiner函数返回的是(sum, count),而Reduce函数再一起计算Average

It's sometimes possible to turn a function that's not commutative or associative.Break it down into functions that are communicative or associative like sum and count and still use a combiner trick to save some foot traffic.

不能应用combiner函数的例子:求中位数函数

It turns out and it can be proved mathematically that there is no way to split the median completely into a bunch of commutative and associative computations.So you can't actually use the combiner trick, You just have to ship all the values to the reducer
and compute the median at the reducer.

改良:partition function划分函数

先回忆MapReduce的下部构造:the map reduced infrastructure uses a hash function on each key in the intermediate key value set,and this hash function decides which reduced node that key gets shipped to.

用户可以自定义这个hash函数

Note: 这个自定义试图将同一个host下的url送到同一个reduce节点上,而不是每一个不同url都对应一个reduce节点。

MapReduce的实现

Note:

1. Google first implemented a file system called the Google File System which is a distributed file system that provides stable storage on top of its cluster.And then implemented the MapReduce framework on top of the Google File System.

2. Hadoop is an open-source project that's a reimplementation of Google's MapReduce.[Download]

3. many use cases of Hadoop involve doing SQL-like manipulations on data.And so there are open-source implementations called Hive and Pig that provide SQL-like abstractions of top of the Hadoop and MapReduce layer, so that you don't have to rewrite those
as map and deduce functions.

MapReduce in cloud 云计算中的MapReduce

Pointers and Further Reading

Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

http://labs.google.com/papers/mapreduce.html

Sanjay Ghemawat, Howard Gobioff, and Shun--.Tak Leung: The Google File System

http://labs.google.com/papers/gfs.html

Hadoop Wiki Introduction  http://wiki.apache.org/lucene-hadoop/

Getting Started  http://wiki.apache.org/lucene-hadoop/GegngStartedWithHadoop   

Map/Reduce Overview  http://wiki.apache.org/lucene-hadoop/HadoopMapReduce  http://wiki.apache.org/lucene-hadoop/ HadoopMapRedClasses   

Eclipse Environment  http://wiki.apache.org/lucene-hadoop/EclipseEnvironment

Javadoc  http://lucene.apache.org/hadoop/docs/api/

ReleasesfromApachedownloadmirrors  http://www.apache.org/dyn/closer.cgi/lucene/hadoop/

Nightlybuildsofsource  http://people.apache.org/dist/lucene/hadoop/nightly/

Sourcecodefromsubversion  http://lucene.apache.org/hadoop/version_control.html11

皮皮Blog

Feedback习题反馈 — Week 1 (Basic)

from:http://blog.csdn.net/pipisorry/article/details/48443533

ref:海量数据挖掘

Python实现MapReduce

海量数据挖掘MMDS week1: MapReduce的更多相关文章

  1. 海量数据挖掘MMDS week6: MapReduce算法(进阶)

    http://blog.csdn.net/pipisorry/article/details/49445519 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  2. 海量数据挖掘MMDS week1: Link Analysis - PageRank

    http://blog.csdn.net/pipisorry/article/details/48579435 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  3. 海量数据挖掘MMDS week7: 局部敏感哈希LSH(进阶)

    http://blog.csdn.net/pipisorry/article/details/49686913 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  4. 海量数据挖掘MMDS week3:社交网络之社区检测:高级技巧

    http://blog.csdn.net/pipisorry/article/details/49052255 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  5. 海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH

    http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  6. 海量数据挖掘MMDS week2: 频繁项集挖掘 Apriori算法的改进:非hash方法

    http://blog.csdn.net/pipisorry/article/details/48914067 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  7. 海量数据挖掘MMDS week7: 相似项的发现:面向高相似度的方法

    http://blog.csdn.net/pipisorry/article/details/49742907 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  8. 海量数据挖掘MMDS week6: 决策树Decision Trees

    http://blog.csdn.net/pipisorry/article/details/49445465 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

  9. 海量数据挖掘MMDS week6: 支持向量机Support-Vector Machines,SVM

    http://blog.csdn.net/pipisorry/article/details/49445387 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

随机推荐

  1. JavaScript while 循环

    JavaScript while 循环的目的是为了反复执行语句或代码块. 只要指定条件为 true,循环就可以一直执行代码块. while 循环 while 循环会在指定条件为真时循环执行代码块. 语 ...

  2. Python 3.3.2 round函数并非"四舍五入"

    对于一些貌似很简单常见的函数,最好还是去读一下Python文档,否则当你被某个BUG折磨得死去活来时,还不知根源所在.尤其是Python这种不断更新的语言.(python 2.7 的round和3.3 ...

  3. mybatis常用配置

    前面两篇博客我们简单介绍了mybatis的使用,但是在mybatis的配置问题上我们只是使用了最基础的配置,本文我们就来说说其他一些常用的配置.如果小伙伴对mybatis尚不了解,可以先参考这两篇博客 ...

  4. 编写高性能的Lua代码

    编写高性能的Lua代码 Posted on2014/04/18· 10 Comments 前言 Lua是一门以其性能著称的脚本语言,被广泛应用在很多方面,尤其是游戏.像<魔兽世界>的插件, ...

  5. 自定义下拉刷新上拉加载View

    MainActivity.java package com.heima52.pullrefresh; import java.util.ArrayList; import com.heima52.pu ...

  6. Cocos2D-ObjC:在RPG游戏中混合Swift代码

    我之前写过一个RPG游戏<<熊猫之魂 SoulOfPanda>> 编译器使用的是SpriteBuilder,很好很强大!全部代码都由Objc完成,现在想尝试一下在其中混入Swi ...

  7. Android安全升级的7.0: Nougat

    Tamic http://www.jianshu.com/users/3bbb1ddf4fd5/latest_articles 今年夏天以来,Google做了多种增强的安全性在Android的7.0N ...

  8. SpriteKit中节点的z-position

    大熊猫猪·侯佩原创或翻译作品.欢迎转载,转载请注明出处. 如果觉得写的不好请多提意见,如果觉得不错请多多支持点赞.谢谢! hopy ;) 免责申明:本博客提供的所有翻译文章原稿均来自互联网,仅供学习交 ...

  9. Struts 2 标签库

    <s:if>标签 拥有一个test属性,其表达式的值用来决定标签里内容是否显示 <s:if test="#request.username=='clf'"> ...

  10. 【一天一道Leetcode】#190.Reverse Bits

    一天一道LeetCode 本系列文章已全部上传至我的github,地址:ZeeCoder's Github 欢迎大家关注我的新浪微博,我的新浪微博 我的个人博客已创建,欢迎大家持续关注! 一天一道le ...