python 运行 hadoop 2.0 mapreduce 程序
要点:#!/usr/bin/python 因为要发送到各个节点,所以py文件必须是可执行的。
1) 统计(所有日志)独立ip数目,即不同ip的总数
####################本地测试############################
cat /home/hadoop/Sep-/*/* | python ipmappper.py | sort | python ipreducer.py
本地部分测试结果:
99.67.46.254 13
99.95.174.29 47
sum of single ip 13349
#####################hadoop集群运行############################
bin/hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -mapper /data/hadoop/jobs_python/job_logstat/ipmapper.py -reducer /data/hadoop/jobs_python/job_logstat/ipreducer.py -input /log_original/* -output /log_ipnum -file /data/hadoop/jobs_python/job_logstat/ipmapper.py -file /data/hadoop/jobs_python/job_logstat/ipreducer.py
集群部分测试结果:
99.67.46.254 13
99.95.174.29 47
sum of single ip 13349 ipmapper.py:
##########################mapper代码#######################################
#!/usr/bin/python
# --*-- coding:utf-8 --*--
import re
import sys pat = re.compile('(?P<ip>\d+.\d+.\d+.\d+).*?"\w+ (?P<subdir>.*?) ')
for line in sys.stdin:
match = pat.search(line)
if match:
print '%s\t%s' % (match.group('ip'), 1) ipreducer.py
##########################reducer代码#####################################
#!/usr/bin/python
from operator import itemgetter
import sys dict_ip_count = {} for line in sys.stdin:
line = line.strip()
ip, num = line.split('\t')
try:
num = int(num)
dict_ip_count[ip] = dict_ip_count.get(ip, 0) + num except ValueError:
pass sorted_dict_ip_count = sorted(dict_ip_count.items(), key=itemgetter(0))
for ip, count in sorted_dict_ip_count:
print '%s\t%s' % (ip, count) 2) 统计(所有日志)每个子目录访问次数
########################本地测试######################################
cat /home/hadoop/Sep-2013/*/* | python subdirmapper.py | sort | python subdirreducer.py
部分结果:
http://dongxicheng.org/recommend/ 2
http://dongxicheng.org/search-engine/scribe-intro/trackback/ 1
http://dongxicheng.org/structure/permutation-combination/ 1
http://dongxicheng.org/structure/sort/trackback/ 1
http://dongxicheng.org/wp-comments-post.php 5
http://dongxicheng.org/wp-login.php/ 3535
http://hadoop123.org/administrator/index.php 4 #######################hadoop集群运行########################################
bin/hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.2..jar -mapper /data/hadoop/jobs_python/job_logstat/subdirmapper.py -reducer /data/hadoop/jobs_python/job_logstat/subdirreducer.py -input /log_original/* -output /log_subdirnum -file /data/hadoop/jobs_python/job_logstat/subdirmapper.py -file /data/hadoop/jobs_python/job_logstat/subdirreducer.py
部分结果:
http://dongxicheng.org/search-engine/scribe-intro/trackback/ 1
http://dongxicheng.org/structure/permutation-combination/ 1
http://dongxicheng.org/structure/sort/trackback/ 1
http://dongxicheng.org/wp-comments-post.php 5
http://dongxicheng.org/wp-login.php/ 3535
http://hadoop123.org/administrator/index.php 4 #######################################mapper代码###########################################
#!/usr/bin/python
# --*-- coding:utf-8 --*--
import re
import sys pat = re.compile('(?P<ip>\d+.\d+.\d+.\d+).*?"\w+ (?P<subdir>.*?) ')
for line in sys.stdin:
match = pat.search(line)
if match:
print '%s\t%s' % (match.group('subdir'), 1)
#######################################reducer代码###########################################
#!/usr/bin/python
from operator import itemgetter
import sys dict_subdir_count = {} for line in sys.stdin:
line = line.strip()
subdir, num = line.split('\t')
try:
num = int(num)
dict_subdir_count[subdir] = dict_subdir_count.get(subdir, 0) + num
except ValueError:
pass sorted_dict_ip_count = sorted(dict_subdir_count.items(), key=itemgetter(0))
for subdir, count in sorted_dict_ip_count:
print '%s\t%s' % (subdir, count)
【还是用java写mr程序吧】
参考网址:
http://asfr.blogbus.com/logs/44208067.html bin/hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -mapper /data/hadoop/mapper.py -reducer /data/hadoop/reducer.py -input /in/* -output /py_out -file /data/hadoop/mapper.py -file /data/hadoop/reducer.py python开发mapreduce的原理:
》与linux管道机制一致
》通过标准输入输出实现进程间通信
》标准输入输出是任何语言都支持的。
举几个例子:
cat 1.txt | grep 'dong' | sort
cat 1.txt | python grep.py | java sort.jar 以标准输入流作为输入:
c++: cin
c: scanf
以标准输出流作为输出:
c++:count
c:printf 局限性:可以实现Mapper Reducer,其他组件需要用java实现。 hadoop-streaming 进行测试很简单的哦。
编译程序,生成可执行文件
g++ -o mapper mapper.cpp
g++ -o reducer reduer.cpp
测试程序:
cat test.txt | ./mappper | sort | ./reducer #!/usr/bin/python
# coding=utf-8 import sys # input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1) #!/usr/bin/python
# coding=utf-8 from operator import itemgetter
import sys # maps words to their counts
word2count = {} # input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip() # parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass # sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
sorted_word2count = sorted(word2count.items(), key=itemgetter(0)) # write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print '%s\t%s'% (word, count) packageJobJar: [/data/hadoop/mapper.py, /data/hadoop/reducer.py, /data/hadoop/hadoop_tmp/hadoop-unjar4601454529868960285/] [] /tmp/streamjob2970217681900457939.jar tmpDir=null
14/03/21 16:23:09 INFO client.RMProxy: Connecting to ResourceManager at /192.168.2.200:8032
14/03/21 16:23:09 INFO client.RMProxy: Connecting to ResourceManager at /192.168.2.200:8032
14/03/21 16:23:10 INFO mapred.FileInputFormat: Total input paths to process : 2
14/03/21 16:23:10 INFO mapreduce.JobSubmitter: number of splits:2
14/03/21 16:23:10 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/03/21 16:23:10 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/03/21 16:23:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1394086709210_0008
14/03/21 16:23:10 INFO impl.YarnClientImpl: Submitted application application_1394086709210_0008 to ResourceManager at /192.168.2.200:8032
14/03/21 16:23:10 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1394086709210_0008/
14/03/21 16:23:10 INFO mapreduce.Job: Running job: job_1394086709210_0008
14/03/21 16:23:14 INFO mapreduce.Job: Job job_1394086709210_0008 running in uber mode : false
14/03/21 16:23:14 INFO mapreduce.Job: map 0% reduce 0%
14/03/21 16:23:19 INFO mapreduce.Job: map 100% reduce 0%
14/03/21 16:23:23 INFO mapreduce.Job: map 100% reduce 100%
14/03/21 16:23:24 INFO mapreduce.Job: Job job_1394086709210_0008 completed successfully
14/03/21 16:23:24 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=47
FILE: Number of bytes written=248092
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=197
HDFS: Number of bytes written=25
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=5259
Total time spent by all reduces in occupied slots (ms)=2298
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=33
Map output materialized bytes=53
Input split bytes=172
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=53
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=71
CPU time spent (ms)=1300
Physical memory (bytes) snapshot=678060032
Virtual memory (bytes) snapshot=2662100992
Total committed heap usage (bytes)=514326528
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=25
File Output Format Counters
Bytes Written=25
14/03/21 16:23:24 INFO streaming.StreamJob: Output directory: /py_out
python 运行 hadoop 2.0 mapreduce 程序的更多相关文章
- Hadoop_05_运行 Hadoop 自带 MapReduce程序
1. MapReduce使用 MapReduce是Hadoop中的分布式运算编程框架,只要按照其编程规范,只需要编写少量的业务逻辑代码即可实现 一个强大的海量数据并发处理程序 2. 运行Hadoop自 ...
- Hadoop学习历程(四、运行一个真正的MapReduce程序)
上次的程序只是操作文件系统,本次运行一个真正的MapReduce程序. 运行的是官方提供的例子程序wordcount,这个例子类似其他程序的hello world. 1. 首先确认启动的正常:运行 s ...
- hadoop下跑mapreduce程序报错
mapreduce真的是门学问,遇到的问题逼着我把它从MRv1摸索到MRv2,从年前就牵挂在心里,连过年回家的旅途上都是心情凝重,今天终于在eclipse控制台看到了job completed suc ...
- 《HBase in Action》 第三章节的学习总结 ---- 如何编写和运行基于HBase的MapReduce程序
HBase之所以与Hadoop是最好的伙伴,我理解就因为两点:1.HADOOP的HDFS,为HBase提供了分布式的存储方式:2.HADOOP的MR为HBase提供的分布式的计算方法.u 其中第一点, ...
- hadoop 第一个 mapreduce 程序(对MapReduce的几种固定代码的理解)
1.2MapReduce 和 HDFS 是如何工作的 MapReduce 其实是两部分,先是 Map 过程,然后是 Reduce 过程.从词频计算来说,假设某个文件块里的一行文字是”Thisis a ...
- 一脸懵逼学习Hadoop中的MapReduce程序中自定义分组的实现
1:首先搞好实体类对象: write 是把每个对象序列化到输出流,readFields是把输入流字节反序列化,实现WritableComparable,Java值对象的比较:一般需要重写toStrin ...
- 在Hadoop上运行基于RMM中文分词算法的MapReduce程序
原文:http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/ 在Hadoop上运行基于RMM中文分词 ...
- 编写简单的Mapreduce程序并部署在Hadoop2.2.0上运行
今天主要来说说怎么在Hadoop2.2.0分布式上面运行写好的 Mapreduce 程序. 可以在eclipse写好程序,export或用fatjar打包成jar文件. 先给出这个程序所依赖的Mave ...
- [MapReduce_3] MapReduce 程序运行流程解析
0. 说明 Word Count 程序运行流程解析 && MapReduce 程序运行流程解析 1. Word Count 程序运行流程解析 2. MapReduce 程序运行流程图
随机推荐
- 单片机IO口驱动能力
以STM32的IO口为例,最大的输出电流和灌入电流在芯片手册上都有说明.单个IO口一般都是十几mA到几十mA,同时总的VDD电流也有限制,大概为150mA.所以单片机驱动外设时,如果不是信号型而是功率 ...
- 假设但是学习java入门,请离开SSH稍远
我觉得有点累了步行上班,我想买一辆自行车.结果去了一看,想2500片.旁边的人说,2500所有最好加一些钱,买一挖电. 遂问电动车价格,3500,决定买.却被告知不如加点钱买小踏板摩托划算.于是看摩托 ...
- Android M(6.0) 权限爬坑之旅
坑一:用Android5.0编译的apk,在Android6.0上运行完全没有问题. 在Android6.0以上才需要在运行时请求权限,在旧Android版本上保留原有逻辑,安装时授予权限. 用旧版本 ...
- Java 线程池的原理与实现(转)
这几天主要是狂看源程序,在弥补了一些以前知识空白的同时,也学会了不少新的知识(比如 NIO),或者称为新技术吧.线程池就是其中之一,一提到线程,我们会想到以前<操作系统>的生产者与消费者, ...
- MySQL查询
DQL 操作 DQL 数据查询语言(重要) 数据库执行DQL语句不会对数据做出任何改变,而是让数据库发送结果集给客户端. 查询返回的结果是一张虚拟表. 查询关键字:SELECT ...
- hp惠普服务器监控硬盘
惠普 hpssacli 工具使用 查看raid卡信息(包括控制器状态.Cache状态.电池状态) # hpssacli ctrl all show status 查看raid详细信息 # hpssac ...
- RedHat7/Windows7搭建JAVA开发环境(Eclipse)
RedHat7搭建JAVA开发环境 安装JAVA # yum install java 安装Tomcat # yum install tomcat 确认Tomcat版本 # tomcat versio ...
- Linux删除文件Argument list too long问题的解决方案
方法一:使用find find . -name 文件 | xargs rm -f 但文件数量过多,find命令也会出现问题: -bash: /bin/find: Argument list too l ...
- VIM中文乱码(_vimrc配置文件备份)
_vimrc在用户目录下: set fileencodings=ucs-bom,utf-,cp936,gb18030,big5,euc-jp,euc-kr,latin1 set encoding=ut ...
- android应用一(调用WebServices)
搞了一个月的android,现学现卖,终于还是搞完了,停下来,整理思路,写写记录吧. 我们知道android访问远程数据库主要有两种协议,一种是SOAP,另外一种就是HTTP.而我们再看看WebSer ...