MR hadoop streaming job的学习 combiner

代码已经拷贝到了公司电脑的：

/Users/baidu/Documents/Data/Work/Code/Self/hadoop_mr_streaming_jobs

首先是主控脚本 main.sh

调用的是 extract.py

然后发现写的不太好。其中有一个combiner，可以看这里：

https://blog.csdn.net/u010700335/article/details/72649186

streaming 脚本的时候，是以管道为基础的：

（5） Python脚本

import sys

for line in sys.stdin:

.......

#!/usr/bin/env python

import sys

# maps words to their counts

word2count = {}

# input comes from STDIN (standard input)

for line in sys.stdin:

    # remove leading and trailing whitespace

    line = line.strip()

    # split the line into words while removing any empty strings

    words = filter(lambda word: word, line.split())

    # increase counters

    for word in words:

        # write the results to STDOUT (standard output);

        # what we output here will be the input for the

        # Reduce step, i.e. the input for reducer.py

        #

        # tab-delimited; the trivial word count is

        print '%s\t%s' % (word, )

#---------------------------------------------------------------------------------------------------------

#!/usr/bin/env python

from operator import itemgetter

import sys

# maps words to their counts

word2count = {}

# input comes from STDIN

for line in sys.stdin:

    # remove leading and trailing whitespace

    line = line.strip()

    # parse the input we got from mapper.py

    word, count = line.split()

    # convert count (currently a string) to int

    try:

        count = int(count)

        word2count[word] = word2count.get(word, ) + count

    except ValueError:

        # count was not a number, so silently

        # ignore/discard this line

        pass

# sort the words lexigraphically;

#

# this step is NOT required, we just do it so that our

# final output will look more like the official Hadoop

# word count examples

sorted_word2count = sorted(word2count.items(), key=itemgetter())

# write the results to STDOUT (standard output)

for word, count in sorted_word2count:

    print '%s\t%s'% (word, count)

MR hadoop streaming job的学习 combiner的更多相关文章

hadoop学习；Streaming，aggregate；combiner
hadoop streaming同意我们使用不论什么可运行脚本来处理按行组织的数据流,数据取自UNIX的标准输入STDIN,并输出到STDOUT 我们能够用 linux命令管道查看文本有多少行,cat ...
Hadoop Streaming框架学习（一）
Hadoop Streaming框架学习(一) Hadoop Streaming框架学习(一) 2013-08-19 12:32 by ATP_, 473 阅读, 3 评论, 收藏, 编辑 1.Had ...
Hadoop Streaming框架学习2
Hadoop Streaming框架学习(二) 1.常用Streaming命令介绍使用下面的命令运行Streaming MapReduce程序: 1: $HADOOP_HOME/bin/hadoop ...
Hadoop Streaming框架学习（二）
1.常用Streaming命令介绍使用下面的命令运行Streaming MapReduce程序: 1: $HADOOP_HOME/bin/hadoop/hadoop streaming args 其 ...
Hadoop Streaming框架使用（一）
Streaming简介 link:http://www.cnblogs.com/luchen927/archive/2012/01/16/2323448.html Streaming框架允许任何程 ...
hadoop streaming 编程
概况 Hadoop Streaming 是一个工具, 代替编写Java的实现类,而利用可执行程序来完成map-reduce过程.一个最简单的程序 $HADOOP_HOME/bin/hadoop jar ...
Hadoop Streaming Command Details and Q&A
Hadoop Streaming Hadoopstreaming is a utility that comes with the Hadoop distribution. The utilityal ...
hadoop streaming编程小demo(python版)
大数据团队搞数据质量评测.自动化质检和监控平台是用django,MR也是通过python实现的.(后来发现有orc压缩问题,python不知道怎么解决,正在改成java版本) 这里展示一个python ...
Hadoop Streaming详解
一: Hadoop Streaming详解 1.Streaming的作用 Hadoop Streaming框架,最大的好处是,让任何语言编写的map, reduce程序能够在hadoop集群上运行:m ...

随机推荐

React第三次入门
传统HTML开发在处理越来越多的服务器数据和用户交互数据反应到复杂界面的时候,代码量越来越大,难以维护. Angular是基于MVVM的开发框架,重量级..不适用于移动端的web栈, 其UI组件的封装 ...
JS计算两个日期之间的天数
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...
vCard
vCard 在翻阅dottoro的时候,在附录(appendix)的js部分,注意到一个叫vCard的部分,能单独列出来,可能是比较重要的,至少是比较独立的部分,但是以前从未听说或者了解过这一部分,如 ...
C#实例：Unity依赖注入使用
http://jingyan.baidu.com/article/c74d6000840b260f6b595d78.html
k8s的deployment应用
Kubernetes 通过各种 Controller 来管理 Pod 的生命周期.为了满足不同业务场景,Kubernetes 开发了 Deployment.ReplicaSet.DaemonSet.S ...
正则表达式、re、常用模块
阅读目录正则表达式字符量词 . ^ $ * + ? { } 字符集［］［^］分组 ()与或 |［^］转义符 \ 贪婪匹配 re 总结正则 re 常用模块 namedtuple deque ...
PDCurses 笔记（一）
之前没有接触过curse和ncurse,平时用的也都是windows系统,所以对PDCurses也挺感兴趣的.网上关于PDCurses的内容也不是很多,但是感觉上它的函数应该都是和其他操作系统里函数都 ...
Hydra--密码破解的神器
原来不止burpsuit.sqlmap是神器,还有Hydra. 虽久闻大名,却未曾使用,今天偶然用到,发现支持的服务那真是多,ftp.ssh.smtp.imap.http...,而且支持ssl 可以想 ...
HDU1009：FatMouse' Trade(初探贪心，wait)
FatMouse prepared M pounds of cat food, ready to trade with the cats guarding the warehouse containi ...
训练指南 UVA - 11383（KM算法的应用 lx+ly >=w(x,y)）
layout: post title: 训练指南 UVA - 11383(KM算法的应用 lx+ly >=w(x,y)) author: "luowentaoaa" cata ...

MR hadoop streaming job的学习 combiner

MR hadoop streaming job的学习 combiner的更多相关文章

随机推荐

热门专题