python 运行 hadoop 2.0 mapreduce 程序

要点：#!/usr/bin/python  因为要发送到各个节点，所以py文件必须是可执行的。 
1）  统计（所有日志）独立ip数目，即不同ip的总数

####################本地测试############################

cat /home/hadoop/Sep-/*/* | python ipmappper.py | sort | python ipreducer.py

本地部分测试结果：

99.67.46.254    13

99.95.174.29    47

sum of single ip    13349

#####################hadoop集群运行############################

bin/hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -mapper /data/hadoop/jobs_python/job_logstat/ipmapper.py -reducer /data/hadoop/jobs_python/job_logstat/ipreducer.py -input /log_original/* -output /log_ipnum -file /data/hadoop/jobs_python/job_logstat/ipmapper.py -file /data/hadoop/jobs_python/job_logstat/ipreducer.py

集群部分测试结果：

99.67.46.254    13

99.95.174.29    47

sum of single ip    13349

ipmapper.py:

##########################mapper代码#######################################

#!/usr/bin/python

# --*-- coding:utf-8 --*--

import re

import sys

pat = re.compile('(?P<ip>\d+.\d+.\d+.\d+).*?"\w+ (?P<subdir>.*?) ')

for line in sys.stdin:

    match = pat.search(line)

    if match:

        print '%s\t%s' % (match.group('ip'), 1)

ipreducer.py

##########################reducer代码#####################################

#!/usr/bin/python

from operator import itemgetter

import sys

dict_ip_count = {}

for line in sys.stdin:

    line = line.strip()

    ip, num = line.split('\t')

    try:

        num = int(num)

        dict_ip_count[ip] = dict_ip_count.get(ip, 0) + num

    except ValueError:

        pass

sorted_dict_ip_count = sorted(dict_ip_count.items(), key=itemgetter(0))

for ip, count in sorted_dict_ip_count:

    print '%s\t%s' % (ip, count)

2）  统计（所有日志）每个子目录访问次数

########################本地测试######################################

cat /home/hadoop/Sep-2013/*/* | python subdirmapper.py | sort | python subdirreducer.py

部分结果：

http://dongxicheng.org/recommend/    2

http://dongxicheng.org/search-engine/scribe-intro/trackback/    1

http://dongxicheng.org/structure/permutation-combination/    1

http://dongxicheng.org/structure/sort/trackback/    1

http://dongxicheng.org/wp-comments-post.php    5

http://dongxicheng.org/wp-login.php/    3535

http://hadoop123.org/administrator/index.php    4

#######################hadoop集群运行########################################

bin/hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.2..jar -mapper /data/hadoop/jobs_python/job_logstat/subdirmapper.py -reducer /data/hadoop/jobs_python/job_logstat/subdirreducer.py -input /log_original/* -output /log_subdirnum -file /data/hadoop/jobs_python/job_logstat/subdirmapper.py -file /data/hadoop/jobs_python/job_logstat/subdirreducer.py

部分结果：

http://dongxicheng.org/search-engine/scribe-intro/trackback/    1

http://dongxicheng.org/structure/permutation-combination/    1

http://dongxicheng.org/structure/sort/trackback/    1

http://dongxicheng.org/wp-comments-post.php    5

http://dongxicheng.org/wp-login.php/    3535

http://hadoop123.org/administrator/index.php    4

#######################################mapper代码###########################################

#!/usr/bin/python

# --*-- coding:utf-8 --*--

import re

import sys

pat = re.compile('(?P<ip>\d+.\d+.\d+.\d+).*?"\w+ (?P<subdir>.*?) ')

for line in sys.stdin:

    match = pat.search(line)

    if match:

        print '%s\t%s' % (match.group('subdir'), 1)

#######################################reducer代码###########################################

#!/usr/bin/python

from operator import itemgetter

import sys

dict_subdir_count = {}

for line in sys.stdin:

    line = line.strip()

    subdir, num = line.split('\t')

    try:

        num = int(num)

        dict_subdir_count[subdir] = dict_subdir_count.get(subdir, 0) + num

    except ValueError:

        pass

sorted_dict_ip_count = sorted(dict_subdir_count.items(), key=itemgetter(0))

for subdir, count in sorted_dict_ip_count:

    print '%s\t%s' % (subdir, count)


【还是用java写mr程序吧】
参考网址：
http://asfr.blogbus.com/logs/44208067.html

bin/hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -mapper /data/hadoop/mapper.py -reducer /data/hadoop/reducer.py -input /in/* -output /py_out -file /data/hadoop/mapper.py -file /data/hadoop/reducer.py 

python开发mapreduce的原理：
》与linux管道机制一致
》通过标准输入输出实现进程间通信
》标准输入输出是任何语言都支持的。
举几个例子：
cat 1.txt | grep 'dong' | sort
cat 1.txt | python grep.py | java sort.jar

以标准输入流作为输入：
c++: cin
c: scanf
以标准输出流作为输出：
c++：count
c：printf

局限性：可以实现Mapper Reducer，其他组件需要用java实现。

hadoop-streaming 进行测试很简单的哦。
编译程序，生成可执行文件
g++ -o mapper mapper.cpp
g++ -o reducer reduer.cpp
测试程序：
cat test.txt | ./mappper | sort | ./reducer


#!/usr/bin/python

# coding=utf-8

import sys

# input comes from STDIN (standard input)

for line in sys.stdin:

    # remove leading and trailing whitespace

    line = line.strip()

    # split the line into words

    words = line.split()

    # increase counters

    for word in words:

        # write the results to STDOUT (standard output);

        # what we output here will be the input for the

        # Reduce step, i.e. the input for reducer.py

        #

        # tab-delimited; the trivial word count is 1

        print '%s\t%s' % (word, 1)

#!/usr/bin/python

# coding=utf-8

from operator import itemgetter

import sys

# maps words to their counts

word2count = {}

# input comes from STDIN

for line in sys.stdin:

    # remove leading and trailing whitespace

    line = line.strip()

    # parse the input we got from mapper.py

    word, count = line.split('\t', 1)

    # convert count (currently a string) to int

    try:

        count = int(count)

        word2count[word] = word2count.get(word, 0) + count

    except ValueError:

        # count was not a number, so silently

        # ignore/discard this line

        pass

# sort the words lexigraphically;

#

# this step is NOT required, we just do it so that our

# final output will look more like the official Hadoop

# word count examples

sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

# write the results to STDOUT (standard output)

for word, count in sorted_word2count:

    print '%s\t%s'% (word, count)

packageJobJar: [/data/hadoop/mapper.py, /data/hadoop/reducer.py, /data/hadoop/hadoop_tmp/hadoop-unjar4601454529868960285/] [] /tmp/streamjob2970217681900457939.jar tmpDir=null
14/03/21 16:23:09 INFO client.RMProxy: Connecting to ResourceManager at /192.168.2.200:8032
14/03/21 16:23:09 INFO client.RMProxy: Connecting to ResourceManager at /192.168.2.200:8032
14/03/21 16:23:10 INFO mapred.FileInputFormat: Total input paths to process : 2
14/03/21 16:23:10 INFO mapreduce.JobSubmitter: number of splits:2
14/03/21 16:23:10 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/03/21 16:23:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1394086709210_0008
14/03/21 16:23:10 INFO impl.YarnClientImpl: Submitted application application_1394086709210_0008 to ResourceManager at /192.168.2.200:8032
14/03/21 16:23:10 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1394086709210_0008/
14/03/21 16:23:10 INFO mapreduce.Job: Running job: job_1394086709210_0008
14/03/21 16:23:14 INFO mapreduce.Job: Job job_1394086709210_0008 running in uber mode : false
14/03/21 16:23:14 INFO mapreduce.Job:  map 0% reduce 0%
14/03/21 16:23:19 INFO mapreduce.Job:  map 100% reduce 0%
14/03/21 16:23:23 INFO mapreduce.Job:  map 100% reduce 100%
14/03/21 16:23:24 INFO mapreduce.Job: Job job_1394086709210_0008 completed successfully
14/03/21 16:23:24 INFO mapreduce.Job: Counters: 43
    File System Counters
        FILE: Number of bytes read=47
        FILE: Number of bytes written=248092
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=197
        HDFS: Number of bytes written=25
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=2
        Launched reduce tasks=1
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=5259
        Total time spent by all reduces in occupied slots (ms)=2298
    Map-Reduce Framework
        Map input records=2
        Map output records=4
        Map output bytes=33
        Map output materialized bytes=53
        Input split bytes=172
        Combine input records=0
        Combine output records=0
        Reduce input groups=3
        Reduce shuffle bytes=53
        Reduce input records=4
        Reduce output records=3
        Spilled Records=8
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=71
        CPU time spent (ms)=1300
        Physical memory (bytes) snapshot=678060032
        Virtual memory (bytes) snapshot=2662100992
        Total committed heap usage (bytes)=514326528
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=25
    File Output Format Counters 
        Bytes Written=25
14/03/21 16:23:24 INFO streaming.StreamJob: Output directory: /py_out

1，hadoop上在java开发可用：

FileSplit fileSplit = (FileSplit)reporter.getInputSplit();

String fileName = fileSplit.getPath().getName();

来获取文件名称。

,2，同样python开发时，可以用：

来获取文件名：

import os

os.environ["map_input_file"]

这里的 map_input_file 相当于map.input.file

python 运行 hadoop 2.0 mapreduce 程序的更多相关文章

Hadoop_05_运行 Hadoop 自带 MapReduce程序
1. MapReduce使用 MapReduce是Hadoop中的分布式运算编程框架,只要按照其编程规范,只需要编写少量的业务逻辑代码即可实现一个强大的海量数据并发处理程序 2. 运行Hadoop自 ...
Hadoop学习历程（四、运行一个真正的MapReduce程序）
上次的程序只是操作文件系统,本次运行一个真正的MapReduce程序. 运行的是官方提供的例子程序wordcount,这个例子类似其他程序的hello world. 1. 首先确认启动的正常:运行 s ...
hadoop下跑mapreduce程序报错
mapreduce真的是门学问,遇到的问题逼着我把它从MRv1摸索到MRv2,从年前就牵挂在心里,连过年回家的旅途上都是心情凝重,今天终于在eclipse控制台看到了job completed suc ...
《HBase in Action》第三章节的学习总结 ---- 如何编写和运行基于HBase的MapReduce程序
HBase之所以与Hadoop是最好的伙伴,我理解就因为两点:1.HADOOP的HDFS,为HBase提供了分布式的存储方式:2.HADOOP的MR为HBase提供的分布式的计算方法.u 其中第一点, ...
hadoop 第一个 mapreduce 程序（对MapReduce的几种固定代码的理解）
1.2MapReduce 和 HDFS 是如何工作的 MapReduce 其实是两部分,先是 Map 过程,然后是 Reduce 过程.从词频计算来说,假设某个文件块里的一行文字是”Thisis a ...
一脸懵逼学习Hadoop中的MapReduce程序中自定义分组的实现
1:首先搞好实体类对象: write 是把每个对象序列化到输出流,readFields是把输入流字节反序列化,实现WritableComparable,Java值对象的比较:一般需要重写toStrin ...
在Hadoop上运行基于RMM中文分词算法的MapReduce程序
原文:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/ 在Hadoop上运行基于RMM中文分词 ...
编写简单的Mapreduce程序并部署在Hadoop2.2.0上运行
今天主要来说说怎么在Hadoop2.2.0分布式上面运行写好的 Mapreduce 程序. 可以在eclipse写好程序,export或用fatjar打包成jar文件. 先给出这个程序所依赖的Mave ...
[MapReduce_3] MapReduce 程序运行流程解析
0. 说明 Word Count 程序运行流程解析 && MapReduce 程序运行流程解析 1. Word Count 程序运行流程解析 2. MapReduce 程序运行流程图

随机推荐

C# 创建新RTF文件
这个和WINDOWS创建RTF文件一样 public void CreateRtfFile(string RtfFileName) { RichTextBox richTextBox1 = new R ...
Android-打反编译工具的一种方法
转载请注明出处:http://blog.csdn.net/goldenfish1919/article/details/41010261 首先我们来看下dex文件的格式: class_defs的结构: ...
Linux下查看系统配置
CPU 1. lscpu:显示cpu架构信息 [xxx@localhost ~]$ lscpu Architecture: x86_64 CPU op-mode(s): -bit, -bit Byte ...
Java基础知识强化之IO流笔记27：FileInputStream读取数据一次一个字节数组byte[ ]
1. FileInputStream读取数据一次一个字节数组byte[ ] 使用FileInputStream一次读取一个字节数组: int read(byte[] b) 返回值:返回值其实是实际 ...
Linux Bash算数运算方法小结
A= B= 方法1:let(中间无空格) let C=$A+$B 方法2:$[ ] C=$[$A+$B] 方法3:$(()) C=$(($A+$B)) 方法4:expr(中间有空格) C=`expr ...
phpcms 换域名
修改/caches/configs/system.php里面所有和域名有关的,把以前的老域名修改为新域名就可以了. 进行后台设置->站点管理对相应的站点的域名进行修改. 更新系统缓存.点击 ...
wps批量使标题靠文档左边
ASP.NET得到系统相关信息
1. 在ASP.NET中专用属性: 获取服务器电脑名:Page.Server.ManchineName 获取用户信息:Page.User 获取客户端电脑名:Page.Request.UserHos ...
jmeter中webdriver插件，进行自动化压测
1.下载JMeterPlugins-WebDriver-1.1.2 2.将JMeterPlugins-WebDriver-1.1.2\lib\ext中的*.jar拷贝到D:\apache-jmeter ...
JAVA-线程安全性
线程安全性: 一个类是线程安全的是指在被多个线程访问时,类可以持续进行正确的行为.不用考虑这些线程运行时环境下的调度和交替. 编写正确的并发程序的关键在于对共享的,可变的状态进行访问管理. 解决方 ...

python 运行 hadoop 2.0 mapreduce 程序

python 运行 hadoop 2.0 mapreduce 程序的更多相关文章

随机推荐

热门专题