Hive中自定义Map/Reduce示例 In Python

Hive支持自定义map与reduce script。接下来我用一个简单的wordcount例子加以说明。使用Python开发(如果使用Java开发，请看这里)。

开发环境:

python:2.7.5

hive:2.3.0

hadoop:2.8.1

一、map与reduce脚本

map脚本(mapper.py)

#!/usr/bin/python

import sys

import re

while True:

   line = sys.stdin.readline().strip()

   if not line:

     break

   p = re.compile(r'\W+')

   words=p.split(line)

   #write the tuples to stdout

   for word in words:

     print '%s\t%s' % (word, "")

reduce脚本(reducer.py)

#!/usr/bin/python

import sys 

# maps words to their counts

word2count = {}

while True:

    line=sys.stdin.readline().strip()

    if not line:

      break

    # parse the input we got from mapper.py

    try:

        word,count= line.split('\t', 1)

    except:

        continue

    # convert count (currently a string) to int

    try:

        count = int(filter(str.isdigit,count))

    except ValueError:

        continue

    try:

        word2count[word] = word2count[word]+count

    except:

        word2count[word] = count

# write the tuples to stdout

# Note: they are unsorted

for word in word2count.keys():

    print '%s\t%s' % ( word, word2count[word] )

注意一点的是，不能使用for line in std.in，因为for是一个字节一个字节的读取，而不是一行一行地读。而且在对map输出的word,count进行拆分时，要注意将拆分的count部分非数字部分去掉，以免count转换成int错误。

二、编写hive hql

drop table if exists raw_lines;

-- create table raw_line, and read all the lines in '/user/inputs', this is the path on your local HDFS

create external table if not exists raw_lines(line string)

ROW FORMAT DELIMITED

stored as textfile

location '/user/inputs';

drop table if exists word_count;

-- create table word_count, this is the output table which will be put in '/user/outputs' as a text file, this is the path on your local HDFS

create external table if not exists word_count(word string, count int)

 ROW FORMAT DELIMITED

 FIELDS TERMINATED BY '\t'

 lines terminated by '\n' STORED AS TEXTFILE LOCATION '/user/outputs/';

-- add the mapper&reducer scripts as resources, please change your/local/path

add file /home/yanggy/mapper.py;

add file /home/yanggy/reducer.py;

from (

        from raw_lines

        map raw_lines.line

        --call the mapper here

        using 'mapper.py'

        as word, count

        cluster by word) map_output

insert overwrite table word_count

reduce map_output.word, map_output.count

--call the reducer here

using 'reducer.py'

as word,count;

Hive中自定义Map/Reduce示例 In Python的更多相关文章

Hive中自定义Map/Reduce示例 In Java
Hive支持自定义map与reduce script.接下来我用一个简单的wordcount例子加以说明. 如果自己使用Java开发,需要处理System.in,System,out以及key/val ...
Python中的Map/Reduce
MapReduce是一种函数式编程模型,用于大规模数据集(大于1TB)的并行运算.概念"Map(映射)"和"Reduce(归约)",是它们的主要思想,都是从函数 ...
Hive中自定义函数
Hive的自定义的函数的步骤: 1°.自定义UDF extends org.apache.hadoop.hive.ql.exec.UDF 2°.需要实现evaluate函数,evaluate函数支持重 ...
perl编程中的map函数示例
转自:http://www.jbxue.com/article/14854.html 发布:脚本学堂/Perl 编辑:JB01 2013-12-20 10:20:01 [大中小] 本文介绍 ...
Hadoop Map/Reduce 示例程序WordCount
#进入hadoop安装目录 cd /usr/local/hadoop #创建示例文件:input #在里面输入以下内容: #Hello world, Bye world! vim input #在hd ...
Python中 filter | map | reduce | lambda的用法
1.filter(function, sequence):对sequence中的item依次执行function(item),将执行结果为True的item组成一个List/String/Tupl ...
python中lambda,map,reduce,filter,zip函数
函数式编程函数式编程(Functional Programming)或者函数程序设计,又称泛函编程,是一种编程范型,它将计算机运算视为数学上的函数计算,并且避免使用程序状态以及易变对象.简单来讲,函 ...
python 中的map(), reduce(), filter
据说是函数式编程的一个函数(然后也有人tucao py不太适合干这个),在我看来算是pythonic的一种写法. 简化了我们的操作,比方我们想将list中的数字都加1,最基本的可能是编写一个函数: I ...
Python 中的 map, reduce, zip, filter, lambda基本使用方法
map(function, sequence[, sequence, ...] 该函数是对sequence中的每个成员调用一次function函数,如果参数有多个,则对每个sequence中对应的元素 ...

随机推荐

NavigationViewController页面间通信及传值
使用进行页面跳转时,应该使用方法来跳转至下一页面,这样的话,下一页面同样在容器中. 1AloneSetPrizeViewController *setPrize = [[AloneSetPrizeVi ...
Wpf中显示Unicode字符
1. 引言今天在写一个小工具,里面有些字符用Unicode字符表示更合适.但是一时之间却不知道怎么写了.经过一番查找,终于找到了办法.记到这里,一是加深印象,二则以备查询. 2. C#中使用Unic ...
day 57 Bootstrap 第一天
一 .bootstrap是什么 http://v3.bootcss.com/css/#grid-options(参考博客) 是一个前端开发的框架. HTML CSS JS 下载地址:https:// ...
python del 方法的使用
在Python 的自带函数中 del 函数是一个非常特殊但是又非常使用的函数 my_list = [1,2,3] my_dict = {"name":"lowman&qu ...
hdu4462--曼哈顿距离
题目大意:有N*N个点的田野,然后有k个点是用来放稻草人的,每个稻草人对周围满足曼哈顿距离的庄稼有保护作用问最小的稻草人的个数能够保护所有庄稼,如果不能保护则输出-1 注意的地方: 1.放稻草人的点 ...
【LeetCode】502. IPO
题目假设 LeetCode 即将开始其 IPO.为了以更高的价格将股票卖给风险投资公司,LeetCode希望在 IPO 之前开展一些项目以增加其资本. 由于资源有限,它只能在 IPO 之前完成最多 ...
rabbitmq系列四之路由
1.路由在上一个的教程中,我们构建了一个简单的日志记录系统.我们能够向许多接收者广播日志消息. 在本次教程中,我们向该系统添加一些特性,比如,我只需要严重错误(erroe级别)的部分日志打印到磁盘文 ...
【STM32H7教程】第14章 STM32H7的电源，复位和时钟系统
完整教程下载地址:http://forum.armfly.com/forum.php?mod=viewthread&tid=86980 第14章 STM32H7的电源,复位和时钟系 ...
python聚类算法实战详细笔记 (python3.6+(win10、Linux))
python聚类算法实战详细笔记 (python3.6+(win10.Linux)) 一.基本概念: 1.计算TF-DIF TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库 ...
jdbc调试sql语句方法
在main命令行输入三个参数到oracle 的 dept2表(自己建的和dept一样(deptno,dname,loc)),插入到数据库中去.通过本例子,学习在java里调试sql的方法. 写完sq ...

Hive中自定义Map/Reduce示例 In Python

一、map与reduce脚本

二、编写hive hql

Hive中自定义Map/Reduce示例 In Python的更多相关文章

随机推荐

热门专题