1.需求
2.思路
3.代码实现
3.1MyWeather 类代码:
这个类主要是用来定义hadoop的配置,在执行计算程序时所需加载的一些类。
package com.hadoop.mr.weather;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.TestMapReduceLazyOutput.TestMapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MyWeather {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf =new Configuration(true);
Job job = Job.getInstance(conf);
job.setJarByClass(MyWeather.class);
//----------conf-----------------------
//---begin Map :
//输入格式化类
// job.setInputFormatClass(ooxx.class);
//设置mapper类
job.setMapperClass(TMapper.class);
job.setMapOutputKeyClass(TQ.class);
job.setMapOutputValueClass(IntWritable.class);
//设置partitioner类
job.setPartitionerClass(TPartitioner.class);
//设置排序比较器类
job.setSortComparatorClass(TSortComparator.class);
//设置combiner类
// job.setCombinerClass(TCombiner.class);
//----end Map
//----begin Reduce:
//设置组比较器的类
job.setGroupingComparatorClass(TGroupingComparator.class);
//设置reducer类
job.setReducerClass(TReducer.class);
//-----end Reduce:
//设置输入数据的路径
Path input = new Path("/data/tq/input");
FileInputFormat.addInputPath(job, input);
//设置输出数据的路径
Path output=new Path("/data/tq/output");
if(output.getFileSystem(conf).exists(output)){
//如果目录存在递归删除
output.getFileSystem(conf).delete(output,true);
}
FileOutputFormat.setOutputPath(job, output);
//设置reduceTask的数量 和 partitions数量对应
job.setNumReduceTasks(2);
//-------------------------------------
job.waitForCompletion(true);
}
}
3.2Tmapper类代码
该类继承Mapper类他的主要作用是对输入的文件做一些预处理工作。
package com.hadoop.mr.weather;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.util.StringUtils;
//TextInputFormat.class --key类型是 longWritable 偏移量 --value是Text类型
public class TMapper extends Mapper<LongWritable, Text, TQ, IntWritable>{
//创建map的 k v 对象
TQ mkey=new TQ(); // map --->key
IntWritable mval=new IntWritable(); //map --->value
//重写map方法
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, TQ, IntWritable>.Context context)
throws IOException, InterruptedException {
/**
1949-10-01 14:21:02 34c
1949-10-01 19:21:02 38c
1949-10-02 14:01:02 36c
1950-01-01 11:21:02 32c
1950-10-01 12:21:02 37c
**/
try {
String[] strs = StringUtils.split(value.toString(),'\t');//对文本将制表符切分
SimpleDateFormat sdf= new SimpleDateFormat("yyyy-MM-dd");
Date date = sdf.parse(strs[0]);
Calendar cal= Calendar.getInstance();
cal.setTime(date);
mkey.setYear(cal.get(Calendar.YEAR));
mkey.setMonth(cal.get(Calendar.MONTH)+1); //第一个月默认从0开始所以加1
mkey.setDay(cal.get(Calendar.DAY_OF_MONTH));
int wd = Integer.parseInt(strs[1].substring(0, strs[1].length()-1));//获取温度字符串并强转为int类型
mkey.setWd(wd);
mval.set(wd);
context.write(mkey, mval);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
3.3TQ类代码
该类实现WritableComparable接口他的作用是给生成相关的属性并重写 写入,读取,比较的方法,
package com.hadoop.mr.weather;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class TQ implements WritableComparable<TQ> {
//定义属性
private int year;
private int month;
private int day;
private int wd; //温度属性
public int getYear() {
return year;
}
public void setYear(int year) {
this.year = year;
}
public int getMonth() {
return month;
}
public void setMonth(int month) {
this.month = month;
}
public int getDay() {
return day;
}
public void setDay(int day) {
this.day = day;
}
public int getWd() {
return wd;
}
public void setWd(int wd) {
this.wd = wd;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(year);
out.writeInt(month);
out.writeInt(day);
out.writeInt(wd);
}
@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
this.year=in.readInt();
this.month=in.readInt();
this.day=in.readInt();
this.wd=in.readInt();
}
@Override
public int compareTo(TQ that) {
//compare方法返回值说明the value 0 if x == y; a value less than 0 if x < y; and a value greater than 0 if x > y
// 日期正序 ,使用这年和那年比较 -.-
int c1=Integer.compare(this.year, that.getYear());
// 如果年份相同比较天
if(c1==0){
int c2=Integer.compare(this.month, that.getMonth());
//如果是同一天返回0
if(c2==0){
return Integer.compare(this.day, that.getDay());
}
return c2;
}
return 0;
}
}
3.4Tpartitioner类代码
该类的作用,是定义输出文件的分布规则,避免产生数据倾斜
package com.hadoop.mr.weather;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner;
public class TPartitioner extends Partitioner<TQ, IntWritable> {
//约定成俗规则:避免数据倾斜,将少的数据都放在一个reduce任务组里,将数据量大的单独放一个任务组里。
@Override
public int getPartition(TQ key, IntWritable value, int numPartitions) {
return key.hashCode() % numPartitions;
}
}
3.5TSortComparator类代码:
该类的作用是定义一个排序比较器
package com.hadoop.mr.weather;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class TSortComparator extends WritableComparator{
public TSortComparator() {
super(TQ.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
TQ t1=(TQ) a;
TQ t2=(TQ) b;
int c1 = Integer.compare(t1.getYear(), t2.getYear());
if(c1==0){
int c2= Integer.compare(t1.getMonth(), t2.getMonth());
if(c2==0){
return -Integer.compare(t1.getWd(), t2.getWd());// -号表示返回温度的倒序排列
}
}
return super.compare(a, b);
}
}
3.6TGroupingComparator类代码:
该类的作用是根据年月两个维度做分组
package com.hadoop.mr.weather;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class TGroupingComparator extends WritableComparator {
public TGroupingComparator() {
super(TQ.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
TQ t1=(TQ) a;
TQ t2=(TQ) b;
int c1 = Integer.compare(t1.getYear(), t2.getYear());
if(c1==0){
return Integer.compare(t1.getMonth(), t2.getMonth()); //返回月份的比较结果来分组
}
return c1;
}
}
3.7TReducer 类代码
该类的作用是定义数据的输出格式和内容
package com.hadoop.mr.weather;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class TReducer extends Reducer<TQ, IntWritable, Text, IntWritable>{
Text rkey=new Text();
IntWritable rval=new IntWritable();
/* (non-Javadoc)
* @see org.apache.hadoop.mapreduce.Reducer#reduce(KEYIN, java.lang.Iterable, org.apache.hadoop.mapreduce.Reducer.Context)
*/
@Override
protected void reduce(TQ key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
//相同的key为一组。。。。
//1970 01 01 88 88
//1970 01 11 78 78
//1970 01 21 68 68
//1970 01 01 58 58
int flag=0; //迭代的次数
int day=0;
for (IntWritable v : values) {
if(flag==0){
//将reduce的key格式化成1970-01-01:88
rkey.set(key.getYear()+"-"+key.getMonth()+"-"+key.getDay()+":"+key.getWd());
//将reduce的value设置为温度
rval.set(key.getWd());
flag++;
day=key.getDay();
context.write(rkey, rval);
}
//如果迭代次数不为0且当前的天不等于迭代得到的天就将新的天气数据赋值给reduce的 kv
if(flag!=0 && day!=key.getDay()){
//将reduce的key格式化成1970-01-01:88
rkey.set(key.getYear()+"-"+key.getMonth()+"-"+key.getDay()+":"+key.getWd());
//将reduce的value设置为温度
rval.set(key.getWd());
context.write(rkey, rval);
break;
}
}
}
}
4.执行程序
4.1将包导出为jar包 上传至服务器
aaarticlea/png;base64," alt="" />
4.2创建hdfs文件输入路径
hdfs dfs -mkdir -p /data/tq/input
4.3上传测试文件到创建的hdfs目录下
[root@node01 ~]# cat tq.txt
-- :: 34c
-- :: 38c
-- :: 36c
-- :: 32c
-- :: 37c
-- :: 23c
-- :: 41c
-- :: 27c
-- :: 45c
-- :: 46c
-- :: 47c
[root@node01 ~]# hdfs dfs -put tq.txt /data/tq/input
4.4服务端执行程序
[root@node01 ~]# hadoop jar Myweather.jar com.hadoop.mr.weather.MyWeather
-- ::, INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
-- ::, WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
-- ::, INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/root/.staging/job_1546092355023_0004
-- ::, INFO input.FileInputFormat: Total input files to process :
-- ::, INFO mapreduce.JobSubmitter: number of splits:
-- ::, INFO Configuration.deprecation: yarn.resourcemanager.zk-address is deprecated. Instead, use hadoop.zk.address
-- ::, INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
-- ::, INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1546092355023_0004
-- ::, INFO mapreduce.JobSubmitter: Executing with tokens: []
-- ::, INFO conf.Configuration: resource-types.xml not found
-- ::, INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
-- ::, INFO impl.YarnClientImpl: Submitted application application_1546092355023_0004
-- ::, INFO mapreduce.Job: The url to track the job: http://node04:8088/proxy/application_1546092355023_0004/
-- ::, INFO mapreduce.Job: Running job: job_1546092355023_0004
-- ::, INFO mapreduce.Job: Job job_1546092355023_0004 running in uber mode : false
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: Job job_1546092355023_0004 completed successfully
-- ::, INFO mapreduce.Job: Counters:
File System Counters
FILE: Number of bytes read=
FILE: Number of bytes written=
FILE: Number of read operations=
FILE: Number of large read operations=
FILE: Number of write operations=
HDFS: Number of bytes read=
HDFS: Number of bytes written=
HDFS: Number of read operations=
HDFS: Number of large read operations=
HDFS: Number of write operations=
Job Counters
Launched map tasks=
Launched reduce tasks=
Rack-local map tasks=
Total time spent by all maps in occupied slots (ms)=
Total time spent by all reduces in occupied slots (ms)=
Total time spent by all map tasks (ms)=
Total time spent by all reduce tasks (ms)=
Total vcore-milliseconds taken by all map tasks=
Total vcore-milliseconds taken by all reduce tasks=
Total megabyte-milliseconds taken by all map tasks=
Total megabyte-milliseconds taken by all reduce tasks=
Map-Reduce Framework
Map input records=
Map output records=
Map output bytes=
Map output materialized bytes=
Input split bytes=
Combine input records=
Combine output records=
Reduce input groups=
Reduce shuffle bytes=
Reduce input records=
Reduce output records=
Spilled Records=
Shuffled Maps =
Failed Shuffles=
Merged Map outputs=
GC time elapsed (ms)=
CPU time spent (ms)=
Physical memory (bytes) snapshot=
Virtual memory (bytes) snapshot=
Total committed heap usage (bytes)=
Peak Map Physical memory (bytes)=
Peak Map Virtual memory (bytes)=
Peak Reduce Physical memory (bytes)=
Peak Reduce Virtual memory (bytes)=
Shuffle Errors
BAD_ID=
CONNECTION=
IO_ERROR=
WRONG_LENGTH=
WRONG_MAP=
WRONG_REDUCE=
File Input Format Counters
Bytes Read=
File Output Format Counters
Bytes Written=
4.5将hdfs上生成的输出文件 拉取到本地
[root@node01 ~]# hdfs dfs -get /data/tq/output/* ./test
4.6查看输出文件
[root@node01 test]# ls
part-r- part-r- _SUCCESS
[root@node01 test]# cat part-r-
[root@node01 test]# cat part-r-
--:
--:
--:
--:
--:
--:
--:
--:
--:
0分区是空的 1分区有程序定义的k v输出。这就发生了数据倾斜,可能上面的Tpartitioner类的代码对数据分布规则定义的不恰当导致的。
5.Combiner说明
由于数据量比较少,这边没有对combiner类做扩展
每一个map都可能会产生大量的本地输出,Combiner的作用就是对map端的输出先做一次合并,以减少在map和reduce节点之间的数据传输量,以提高网络IO性能,是MapReduce的一种优化手段之一,其具体的作用如下所述。
(1)Combiner最基本是实现本地key的聚合,对map输出的key排序,value进行迭代。如下所示:
map: (K1, V1) → list(K2, V2)
combine: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
(2)Combiner还有本地reduce功能(其本质上就是一个reduce),例如Hadoop自带的wordcount的例子和找出value的最大值的程序,combiner和reduce完全一致,如下所示:
map: (K1, V1) → list(K2, V2)
combine: (K2, list(V2)) → list(K3, V3)
reduce: (K3, list(V3)) → list(K4, V4)
如果在wordcount中不用combiner,那么所有的结果都是reduce完成,效率会相对低下。使用combiner之后,先完成的map会在本地聚合,提升速度。对于hadoop自带的wordcount的例子,value就是一个叠加的数字,所以map一结束就可以进行reduce的value叠加,而不必要等到所有的map结束再去进行reduce的value叠加。
- Android之自定义控件实现天气温度折线图和饼状图
以前写了个天气的APP,最近把他更新了一个版本,就抽取其中的天气温度折现图这个功能写了这篇博客,来与大家分享,希望对你有所帮助. 效果如图: 代码: MainActivity.Java /**** * ...
- mapreduce案例:获取PI的值
mapreduce案例:获取PI的值 * content:核心思想是向以(0,0),(0,1),(1,0),(1,1)为顶点的正方形中投掷随机点. * 统计(0.5,0.5)为圆心的单位圆中落点占总落 ...
- 【Hadoop离线基础总结】MapReduce案例之自定义groupingComparator
MapReduce案例之自定义groupingComparator 求取Top 1的数据 需求 求出每一个订单中成交金额最大的一笔交易 订单id 商品id 成交金额 Order_0000005 Pdt ...
- 【Hadoop学习之九】MapReduce案例分析一-天气
环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk8 hadoop-3.1.1 找出每个月气温最高的2天 1949 ...
- 【尚学堂·Hadoop学习】MapReduce案例1--天气
案例描述 找出每个月气温最高的2天 数据集 -- :: 34c -- :: 38c -- :: 36c -- :: 32c -- :: 37c -- :: 23c -- :: 41c -- :: 27 ...
- 【尚学堂·Hadoop学习】MapReduce案例2--好友推荐
案例描述 根据好友列表,推荐好友的好友 数据集 tom hello hadoop cat world hadoop hello hive cat tom hive mr hive hello hive ...
- MapReduce案例:统计共同好友+订单表多表合并+求每个订单中最贵的商品
案例三: 统计共同好友 任务需求: 如下的文本, A:B,C,D,F,E,OB:A,C,E,KC:F,A,D,ID:A,E,F,LE:B,C,D,M,LF:A,B,C,D,E,O,MG:A,C,D,E ...
- Hadoop Mapreduce 案例 wordcount+统计手机流量使用情况
mapreduce设计思想 概念:它是一个分布式并行计算的应用框架它提供相应简单的api模型,我们只需按照这些模型规则编写程序,即可实现"分布式并行计算"的功能. 案例一:word ...
- MapReduce案例-好友推荐
用过各种社交平台(如QQ.微博.朋友网等等)的小伙伴应该都知道有一个叫 "可能认识" 或者 "好友推荐" 的功能(如下图).它的算法主要是根据你们之间的共同好友 ...
随机推荐
- POI 2018.10.22
[POI2015]ODW 喵锟讲过.分块. N>=blo,那就暴力倍增往上跳.O(N/blo*logN) N<blo,预处理,f[i][j]表示,i往上跳,每次跳j步,到根节点为止,权值和 ...
- BZOJ1832 聚会
Description:Y岛风景美丽宜人,气候温和,物产丰富.Y岛上有N个城市,有N-1条城市间的道路连接着它们.每一条道路都连接某两个城市.幸运的是,小可可通过这些道路可以走遍Y岛的所有城市.神奇的 ...
- 【套题】qbxt国庆刷题班D2
D2 今天的题感觉还是好妙的 T1 传送门 Description 现在有一张\(n\)个节点\(m\)条边的无向连通图\(G=(V,E)\),满足这张图中不存在长度大于等于3的环且图中没有重边和自环 ...
- Codeforces Round #542 [Alex Lopashev Thanks-Round] (Div. 2) 题解
Codeforces Round #542 [Alex Lopashev Thanks-Round] (Div. 2) 题目链接:https://codeforces.com/contest/1130 ...
- PowerDesigner16连接mysql5.6逆向生成PDM
一:首先安装ODBC驱动 https://dev.mysql.com/downloads/connector/odbc/ ,安装32位驱动 二:然后配置好ODBC数据源,控制面板\系统和安全\管理 ...
- NGINX: Primary script unknown
参考: [ StackExchange ] 这里的解决方式应该是你排查了所有原因依然无法解决问题. SELINUX 更改 selinux 配置 chcon -R -t httpd_sys_conten ...
- chrome表单自动填充导致input文本框背景变成偏黄色问题
你曾遇到过吗? 困扰宝宝好久的问题,本以为是什么插件导致的,结果是chrome浏览器自动填充文本时默认的样式,搜嘎. 一.修改自动填充input文本框背景色: 使用以下代码 可以设置自己的想要的默认文 ...
- 对象方法、类方法、原型方法 && 私有属性、公有属性、公有静态属性
<html> <head> <meta http-equiv="Content-Type" content="text/html; char ...
- crontab 详解 -- (转)
cron 是一个可以用来根据时间.日期.月份.星期的组合来调度对重复任务的执行的守护进程. cron 假定系统持续运行.如果当某任务被调度时系统不在运行,该任务就不会被执行. 要使用 cron 服务, ...
- Java 将html导出word格式
@RequestMapping("download") public void exportWord( HttpServletRequest request, HttpServle ...