从groupby 理解mapper-reducer
注,reduce之前已经shuff。

mapper.py
#!/usr/bin/env python
"""mapper.py"""
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
reducer.py
#!/usr/bin/env python
"""reducer.py"""
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
Improved Mapper and Reducer code: using Python iterators and generators
mapper.py
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
reducer.py
#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""
from itertools import groupby
from operator import itemgetter
import sys
def read_mapper_output(file, separator='\t'):
for line in file:
yield line.rstrip().split(separator, 1)
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin, separator=separator)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
# current_word - string containing a word (the key)
# group - iterator yielding all ["<current_word>", "<count>"] items
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass
if __name__ == "__main__":
main()
从groupby 理解mapper-reducer的更多相关文章
- hadoop2.7之Mapper/reducer源码分析
一切从示例程序开始: 示例程序 Hadoop2.7 提供的示例程序WordCount.java package org.apache.hadoop.examples; import java.io.I ...
- hadoop mapper reducer
Local模式运行MR流程------------------------- 1.创建外部Job(mapreduce.Job),设置配置信息 2.通过jobsubmitter将job.xml + sp ...
- Mapper 与 Reducer 解析
1 . 旧版 API 的 Mapper/Reducer 解析 Mapper/Reducer 中封装了应用程序的数据处理逻辑.为了简化接口,MapReduce 要求所有存储在底层分布式文件系统上的数据均 ...
- Mapper类/Reducer类中的setup方法和cleanup方法以及run方法的介绍
在hadoop的源码中,基类Mapper类和Reducer类中都是只包含四个方法:setup方法,cleanup方法,run方法,map方法.如下所示: 其方法的调用方式是在run方法中,如下所示: ...
- JVM | 第1部分:自动内存管理与性能调优《深入理解 Java 虚拟机》
目录 前言 1. 自动内存管理 1.1 JVM运行时数据区 1.2 Java 内存结构 1.3 HotSpot 虚拟机创建对象 1.4 HotSpot 虚拟机的对象内存布局 1.5 访问对象 2. 垃 ...
- hadoop之mapper类妙用
1. Mapper类 首先 Mapper类有四个方法: (1) protected void setup(Context context) (2) Protected void map(KEYIN k ...
- Mybatis 入门到理解篇
MyBatis MyBatis 本是apache的一个开源项目iBatis, 2010年这个项目由apache software foundation 迁移到了google code, ...
- 【转】Hive配置文件中配置项的含义详解(收藏版)
http://www.aboutyun.com/thread-7548-1-1.html 这里面列出了hive几乎所有的配置项,下面问题只是说出了几种配置项目的作用.更多内容,可以查看内容问题导读:1 ...
- 为你揭秘知乎是如何搞AI的——窥大厂 | 数智方法论第1期
文章发布于公号[数智物语] (ID:decision_engine),关注公号不错过每一篇干货. 数智物语(公众号ID:decision_engine)出品 策划.编写:卷毛雅各布 「我们相信,在垃圾 ...
随机推荐
- 常见问题:计算机网络/运输层/UDP
几乎不对IP增加其他东西,无连接. 优势 速度快.适合实时. 无连接建立,没有连接时延. 无连接状态. 分组首部开销小.TCP需20字节,UDP仅需8字节. 使用UDP的协议 DNS SNMP RIP ...
- linux maven环境变量配置
export MAVEN_HOME=/opt/hjyang/soft/maven export MAVEN_HOME export PATH=$PATH:$MAVEN_HOME/bin
- mysql数据表的编辑
创建数据表 create table [if not exists] 表名(字段列表, [约束或索引列表]) [表选项列表]; 删除数据表 drop table [if exists] ...
- [转载]SQL Server提权系列
本文原文地址:https://www.cnblogs.com/wintrysec/p/10875232.html 一.利用xp_cmdshell提权 xp_cmdshell默认是关闭的,可以通过下面的 ...
- nload 源码安装
nload 1.下载 wget http://www.roland-riegel.de/nload/nload-0.7.4.tar.gz 2.解压 tar zvxf nload-0.7.4.tar.g ...
- DCEP究竟是什么?
DCEP (Digital Currency Electronic Payment) 数字货币电子支付工具 DCEP将由中国人民银行推出,推出时间待定. DCEP是使用区块链技术的一种联盟链,为全新的 ...
- [转帖]HBase详解(很全面)
HBase详解(很全面) very long story 简单看了一遍 很多不明白的地方.. 2018-06-08 16:12:32 卢子墨 阅读数 34857更多 分类专栏: HBase [转自 ...
- 1. Spark基础解析
1.1 Spark概述 1.1.1 什么是Spark 官网:http://spark.apache.org Spark是一种快速.通用.可扩展的大数据分析引擎,2009年诞生于加州大学伯克利分校AMP ...
- Istio技术与实践6:Istio如何为服务提供安全防护能力
凡是产生连接关系,就必定带来安全问题,人类社会如此,服务网格世界,亦是如此. 今天,我们就来谈谈Istio第二主打功能---保护服务. 那么,便引出3个问题: l Istio凭什么保护服务? l ...
- 使用parted对Linux未分区部分进行分区
1. 使用命令parted -l 查看当前分区 可以看到硬盘有2396GB即有2.5T , 但是分区就分了50G一个盘, 需要分剩下部分 [root@localhost ~]# parted -l M ...