Hadoop学习之第一个MapReduce程序
期望
通过这个mapreduce程序了解mapreduce程序执行的流程,着重从程序解执行的打印信息中提炼出有用信息。
执行前
程序代码
程序代码基本上是《hadoop权威指南》上原封不动搬下来的,目的为求出某一年份中最高气温,相关代码如下:
public class NcdcWeather {
private String USAF_station_id;
private String WBAN_station_id;
private String date;
private String time;
private String latitude;
private String longitude;
/** 海拔*/
private String elevation;
/** 风向*/
private String wind_direction;
private String wind_direction_quality_code;
private String sky_ceiling_height;
private String sky_ceiling_height_quality_code;
private String visibility_distance;
private String visibility_distance_quality_code;
private String air_temperature;
private String air_temperature_quality_code;
private String dew_point_temperature;
private String dew_point_temperature_quality_code;
private String atmospheric_pressure;
private String atmospheric_pressure_quality_code; public NcdcWeather(String rowData) {
if (StringUtils.isEmpty(rowData) || rowData.length() < 105) {
return;
} USAF_station_id = rowData.substring(4, 10);
WBAN_station_id = rowData.substring(10, 15);
date = rowData.substring(15, 23);
time = rowData.substring(23, 27);
latitude = rowData.substring(28, 34);
longitude = rowData.substring(34, 41);
elevation = rowData.substring(46, 51);
wind_direction = rowData.substring(60, 63);
wind_direction_quality_code = rowData.substring(63, 64);
sky_ceiling_height = rowData.substring(70, 75);
sky_ceiling_height_quality_code = rowData.substring(75, 76);
visibility_distance = rowData.substring(78, 84);
visibility_distance_quality_code = rowData.substring(84, 85);
air_temperature = rowData.substring(87, 92);
air_temperature_quality_code = rowData.substring(92, 93);
dew_point_temperature = rowData.substring(93, 98);
dew_point_temperature_quality_code = rowData.substring(98, 99);
atmospheric_pressure = rowData.substring(99, 104);
atmospheric_pressure_quality_code = rowData.substring(104, 105);
} public String getUSAF_station_id() {
return USAF_station_id;
} public void setUSAF_station_id(String USAF_station_id) {
this.USAF_station_id = USAF_station_id;
} public String getWBAN_station_id() {
return WBAN_station_id;
} public void setWBAN_station_id(String WBAN_station_id) {
this.WBAN_station_id = WBAN_station_id;
} public String getDate() {
return date;
} public void setDate(String date) {
this.date = date;
} public String getTime() {
return time;
} public void setTime(String time) {
this.time = time;
} public String getLatitude() {
return latitude;
} public void setLatitude(String latitude) {
this.latitude = latitude;
} public String getLongitude() {
return longitude;
} public void setLongitude(String longitude) {
this.longitude = longitude;
} public String getElevation() {
return elevation;
} public void setElevation(String elevation) {
this.elevation = elevation;
} public String getWind_direction() {
return wind_direction;
} public void setWind_direction(String wind_direction) {
this.wind_direction = wind_direction;
} public String getWind_direction_quality_code() {
return wind_direction_quality_code;
} public void setWind_direction_quality_code(String wind_direction_quality_code) {
this.wind_direction_quality_code = wind_direction_quality_code;
} public String getSky_ceiling_height() {
return sky_ceiling_height;
} public void setSky_ceiling_height(String sky_ceiling_height) {
this.sky_ceiling_height = sky_ceiling_height;
} public String getSky_ceiling_height_quality_code() {
return sky_ceiling_height_quality_code;
} public void setSky_ceiling_height_quality_code(String sky_ceiling_height_quality_code) {
this.sky_ceiling_height_quality_code = sky_ceiling_height_quality_code;
} public String getVisibility_distance() {
return visibility_distance;
} public void setVisibility_distance(String visibility_distance) {
this.visibility_distance = visibility_distance;
} public String getVisibility_distance_quality_code() {
return visibility_distance_quality_code;
} public void setVisibility_distance_quality_code(String visibility_distance_quality_code) {
this.visibility_distance_quality_code = visibility_distance_quality_code;
} public String getAir_temperature() {
return air_temperature;
} public void setAir_temperature(String air_temperature) {
this.air_temperature = air_temperature;
} public String getAir_temperature_quality_code() {
return air_temperature_quality_code;
} public void setAir_temperature_quality_code(String air_temperature_quality_code) {
this.air_temperature_quality_code = air_temperature_quality_code;
} public String getDew_point_temperature() {
return dew_point_temperature;
} public void setDew_point_temperature(String dew_point_temperature) {
this.dew_point_temperature = dew_point_temperature;
} public String getDew_point_temperature_quality_code() {
return dew_point_temperature_quality_code;
} public void setDew_point_temperature_quality_code(String dew_point_temperature_quality_code) {
this.dew_point_temperature_quality_code = dew_point_temperature_quality_code;
} public String getAtmospheric_pressure() {
return atmospheric_pressure;
} public void setAtmospheric_pressure(String atmospheric_pressure) {
this.atmospheric_pressure = atmospheric_pressure;
} public String getAtmospheric_pressure_quality_code() {
return atmospheric_pressure_quality_code;
} public void setAtmospheric_pressure_quality_code(String atmospheric_pressure_quality_code) {
this.atmospheric_pressure_quality_code = atmospheric_pressure_quality_code;
} @Override
public String toString() {
return "NcdcWeather{" +
"USAF_station_id='" + USAF_station_id + '\'' +
", WBAN_station_id='" + WBAN_station_id + '\'' +
", date='" + date + '\'' +
", time='" + time + '\'' +
", latitude='" + latitude + '\'' +
", longitude='" + longitude + '\'' +
", elevation='" + elevation + '\'' +
", wind_direction='" + wind_direction + '\'' +
", wind_direction_quality_code='" + wind_direction_quality_code + '\'' +
", sky_ceiling_height='" + sky_ceiling_height + '\'' +
", sky_ceiling_height_quality_code='" + sky_ceiling_height_quality_code + '\'' +
", visibility_distance='" + visibility_distance + '\'' +
", visibility_distance_quality_code='" + visibility_distance_quality_code + '\'' +
", air_temperature='" + air_temperature + '\'' +
", air_temperature_quality_code='" + air_temperature_quality_code + '\'' +
", dew_point_temperature='" + dew_point_temperature + '\'' +
", dew_point_temperature_quality_code='" + dew_point_temperature_quality_code + '\'' +
", atmospheric_pressure='" + atmospheric_pressure + '\'' +
", atmospheric_pressure_quality_code='" + atmospheric_pressure_quality_code + '\'' +
'}';
}
}
Weather Bean
public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private static int MISS_CODE = 9999; @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
NcdcWeather ncdcWeather = new NcdcWeather(line);
String year = ncdcWeather.getDate().substring(0, 4);
int temperature = 0;
if (ncdcWeather.getAir_temperature().startsWith("+")) {
temperature = Integer.parseInt(ncdcWeather.getAir_temperature().substring(1));
} else {
temperature = Integer.parseInt(ncdcWeather.getAir_temperature());
} if (temperature != MISS_CODE && ncdcWeather.getAir_temperature_quality_code().matches("[01459]")) {
context.write(new Text(year), new IntWritable(temperature));
}
}
}
Mapper
public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int max = Integer.MIN_VALUE;
for (IntWritable temp : values) {
max = Math.max(max, temp.get());
} context.write(key, new IntWritable(max));
}
}
Reducer
public class MaxTemperature {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
} Job job = Job.getInstance();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max Temperature"); FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
数据准备
提前往hdfs中放了1901、1902两个年份的天气数据如下图所示:
验证执行
将以上代码打成一个jar,push到我们的虚拟机Hadoop环境中,为防止程序bug也为了节省执行时间,我们先在一个小集群、小数据量中验证代码。为此,我已经提前准备好了一个伪分布式(所谓为分布式即只有一个节点的全分布式)集群环境,下面开始在该环境中执行以上程序:
yarn jar ~/max-temperature-1.0-SNAPSHOT-jar-with-dependencies.jar /ncdc/ /max_temperature_out/
成功执行时的日志打印:
-- ::, INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:
-- ::, WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
-- ::, INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoopuser/.staging/job_1568096329203_0001
-- ::, INFO input.FileInputFormat: Total input files to process :
-- ::, INFO mapreduce.JobSubmitter: number of splits:
-- ::, INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
-- ::, INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568096329203_0001
-- ::, INFO mapreduce.JobSubmitter: Executing with tokens: []
-- ::, INFO conf.Configuration: resource-types.xml not found
-- ::, INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
-- ::, INFO impl.YarnClientImpl: Submitted application application_1568096329203_0001
-- ::, INFO mapreduce.Job: The url to track the job: http://slave1:8088/proxy/application_1568096329203_0001/
-- ::, INFO mapreduce.Job: Running job: job_1568096329203_0001
-- ::, INFO mapreduce.Job: Job job_1568096329203_0001 running in uber mode : false
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: Job job_1568096329203_0001 completed successfully
-- ::, INFO mapreduce.Job: Counters:
File System Counters
FILE: Number of bytes read=
FILE: Number of bytes written=
FILE: Number of read operations=
FILE: Number of large read operations=
FILE: Number of write operations=
HDFS: Number of bytes read=
HDFS: Number of bytes written=
HDFS: Number of read operations=
HDFS: Number of large read operations=
HDFS: Number of write operations=
HDFS: Number of bytes read erasure-coded=
Job Counters
Launched map tasks=
Launched reduce tasks=
Data-local map tasks=
Total time spent by all maps in occupied slots (ms)=
Total time spent by all reduces in occupied slots (ms)=
Total time spent by all map tasks (ms)=
Total time spent by all reduce tasks (ms)=
Total vcore-milliseconds taken by all map tasks=
Total vcore-milliseconds taken by all reduce tasks=
Total megabyte-milliseconds taken by all map tasks=
Total megabyte-milliseconds taken by all reduce tasks=
Map-Reduce Framework
Map input records=
Map output records=
Map output bytes=
Map output materialized bytes=
Input split bytes=
Combine input records=
Combine output records=
Reduce input groups=
Reduce shuffle bytes=
Reduce input records=
Reduce output records=
Spilled Records=
Shuffled Maps =
Failed Shuffles=
Merged Map outputs=
GC time elapsed (ms)=
CPU time spent (ms)=
Physical memory (bytes) snapshot=
Virtual memory (bytes) snapshot=
Total committed heap usage (bytes)=
Peak Map Physical memory (bytes)=
Peak Map Virtual memory (bytes)=
Peak Reduce Physical memory (bytes)=
Peak Reduce Virtual memory (bytes)=
Shuffle Errors
BAD_ID=
CONNECTION=
IO_ERROR=
WRONG_LENGTH=
WRONG_MAP=
WRONG_REDUCE=
File Input Format Counters
Bytes Read=
File Output Format Counters
Bytes Written=
接下来逐行理解日志含义:
-- ::, INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:
#emm 我要先找到资源管理器 -- ::, WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
#ohoh,代码里用的运行方式过时了,推荐使用继承Tool接口的方式来实现它 -- ::, INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoopuser/.staging/job_1568096329203_0001
-- ::, INFO input.FileInputFormat: Total input files to process :
#输入文件数 2 -- ::, INFO mapreduce.JobSubmitter: number of splits:
#分片数2(与输入块数出奇的一致呢~) -- ::, INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
-- ::, INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568096329203_0001
#提交作业token,终于看到作业命名规则了,大概是job_<时间戳>_<4位数的递增序号> -- ::, INFO mapreduce.JobSubmitter: Executing with tokens: []
-- ::, INFO conf.Configuration: resource-types.xml not found
-- ::, INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
-- ::, INFO impl.YarnClientImpl: Submitted application application_1568096329203_0001
#似乎是提交作业到Yarn了 -- ::, INFO mapreduce.Job: The url to track the job: http://slave1:8088/proxy/application_1568096329203_0001/
#em,提供了一个追踪作业执行进度的url -- ::, INFO mapreduce.Job: Running job: job_1568096329203_0001
-- ::, INFO mapreduce.Job: Job job_1568096329203_0001 running in uber mode : false
#作业不满足uber作业的条件,将以非uber模式执行 -- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: Job job_1568096329203_0001 completed successfully
#em 作业执行成功了 -- ::, INFO mapreduce.Job: Counters:
#下面是作业的一些计数器数据
File System Counters
FILE: Number of bytes read=
FILE: Number of bytes written=
FILE: Number of read operations=
FILE: Number of large read operations=
FILE: Number of write operations=
HDFS: Number of bytes read=
HDFS: Number of bytes written=
HDFS: Number of read operations=
HDFS: Number of large read operations=
HDFS: Number of write operations=
HDFS: Number of bytes read erasure-coded=
Job Counters
Launched map tasks=
Launched reduce tasks=
Data-local map tasks=
Total time spent by all maps in occupied slots (ms)=
Total time spent by all reduces in occupied slots (ms)=
Total time spent by all map tasks (ms)=
Total time spent by all reduce tasks (ms)=
Total vcore-milliseconds taken by all map tasks=
Total vcore-milliseconds taken by all reduce tasks=
Total megabyte-milliseconds taken by all map tasks=
Total megabyte-milliseconds taken by all reduce tasks=
Map-Reduce Framework
Map input records=
Map output records=
Map output bytes=
Map output materialized bytes=
Input split bytes=
Combine input records=
Combine output records=
Reduce input groups=
Reduce shuffle bytes=
Reduce input records=
Reduce output records=
Spilled Records=
Shuffled Maps =
Failed Shuffles=
Merged Map outputs=
GC time elapsed (ms)=
CPU time spent (ms)=
Physical memory (bytes) snapshot=
Virtual memory (bytes) snapshot=
Total committed heap usage (bytes)=
Peak Map Physical memory (bytes)=
Peak Map Virtual memory (bytes)=
Peak Reduce Physical memory (bytes)=
Peak Reduce Virtual memory (bytes)=
Shuffle Errors
BAD_ID=
CONNECTION=
IO_ERROR=
WRONG_LENGTH=
WRONG_MAP=
WRONG_REDUCE=
File Input Format Counters
Bytes Read=
File Output Format Counters
Bytes Written=
再看看我们指定的输出目录下到底输出了什么:
可以看到,输出目录包含两个文件,其中名为_SUCCESS的空文件标识作业执行成功,名为part-r-00000的文件记录Reduce任务的输出。查看part-r-00000文件内容:
可以看到,顺利得出了1901年和1902年的最高气温分别是317和244。
这里我们可以通过非mapreduce的方式去计算一下最高气温,以验证这个mapred程序的气温计算是否正确,但由于这不是我们的重点,而且程序逻辑比较简单,姑且认为至结果可信。
猜想
基于输入输入和程序执行过程中的日志打印,我们做如下猜想:
猜想1、分片数等于输入的块数
猜想2、计数器中File Input Format Counters等于读入的文件总字节数
猜想3、计数器中File Output Format Counters等于写入输出目录的总字节数
验证猜想
在搭建的分布式环境中运行以上代码,以验证猜想。分布式环境中准备好的输入文件如下:
即这个集群中准备了44个年份的数据文件,总大小为3137231401字节,分成了61个数据块。
下面开始提交执行作业:
yarn jar ~/max-temperature-1.0-SNAPSHOT-jar-with-dependencies.jar /ncdc/raw/ /max_temperature_out/
作业提交和执行并非一帆风顺,期间遇到过很多问题,比较典型的是java.net.NoRouteToHostException,对这个错误的解决过程记录在《Hadoop学习问题记录之基础篇》的问题一。
解决遇到的问题后,成功运行的日志打印如下:
[hadoop_user@master hadoop-3.2.]$ yarn jar ~/max-temperature-1.0-SNAPSHOT-jar-with-dependencies.jar /ncdc/raw/ /max_out
-- ::, INFO client.RMProxy: Connecting to ResourceManager at master/192.168.212.132:
-- ::, WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
-- ::, INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop_user/.staging/job_1568605271327_0002
-- ::, INFO input.FileInputFormat: Total input files to process :
-- ::, INFO mapreduce.JobSubmitter: number of splits:
-- ::, INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
-- ::, INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1568605271327_0002
-- ::, INFO mapreduce.JobSubmitter: Executing with tokens: []
-- ::, INFO conf.Configuration: resource-types.xml not found
-- ::, INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
-- ::, INFO impl.YarnClientImpl: Submitted application application_1568605271327_0002
-- ::, INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1568605271327_0002/
-- ::, INFO mapreduce.Job: Running job: job_1568605271327_0002
-- ::, INFO mapreduce.Job: Job job_1568605271327_0002 running in uber mode : false
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: map % reduce %
-- ::, INFO mapreduce.Job: Job job_1568605271327_0002 completed successfully
-- ::, INFO mapreduce.Job: Counters:
File System Counters
FILE: Number of bytes read=
FILE: Number of bytes written=
FILE: Number of read operations=
FILE: Number of large read operations=
FILE: Number of write operations=
HDFS: Number of bytes read=
HDFS: Number of bytes written=
HDFS: Number of read operations=
HDFS: Number of large read operations=
HDFS: Number of write operations=
HDFS: Number of bytes read erasure-coded=
Job Counters
Killed map tasks=
Launched map tasks=
Launched reduce tasks=
Data-local map tasks=
Rack-local map tasks=
Total time spent by all maps in occupied slots (ms)=
Total time spent by all reduces in occupied slots (ms)=
Total time spent by all map tasks (ms)=
Total time spent by all reduce tasks (ms)=
Total vcore-milliseconds taken by all map tasks=
Total vcore-milliseconds taken by all reduce tasks=
Total megabyte-milliseconds taken by all map tasks=
Total megabyte-milliseconds taken by all reduce tasks=
Map-Reduce Framework
Map input records=
Map output records=
Map output bytes=
Map output materialized bytes=
Input split bytes=
Combine input records=
Combine output records=
Reduce input groups=
Reduce shuffle bytes=
Reduce input records=
Reduce output records=
Spilled Records=
Shuffled Maps =
Failed Shuffles=
Merged Map outputs=
GC time elapsed (ms)=
CPU time spent (ms)=
Physical memory (bytes) snapshot=
Virtual memory (bytes) snapshot=
Total committed heap usage (bytes)=
Peak Map Physical memory (bytes)=
Peak Map Virtual memory (bytes)=
Peak Reduce Physical memory (bytes)=
Peak Reduce Virtual memory (bytes)=
Shuffle Errors
BAD_ID=
CONNECTION=
IO_ERROR=
WRONG_LENGTH=
WRONG_MAP=
WRONG_REDUCE=
File Input Format Counters
Bytes Read=
File Output Format Counters
Bytes Written=
通过日志可以得出
猜想1错误,输入分片数是58,并不等于输入文件的块数61。
猜想2错误,File Input Format Counters并不总是等于实际输入文件字节数,这里mapred任务统计的读入字节数为3137288745,略大于fsck工具检测出的字节数3137231401。
猜想3正确,File Output Format Counters等于实际输出的part-r-*的文件总字节数。
结论
- 分片数不会一定等于输入块数。
- File Input Format Conters并不总等于fsck工具检测出的输入文件总字节数。
- File Output Format Conters会等于实际输出的所有part-r-*文件总字节数
ps1:以上结论只是在处理文本输入的情况下得出的,对于其它类型的InputFormat是否会违背上述结论,这里先打个问号。
ps2:关于分片数的问题,将另起篇幅叙述,这里暂且不话。见这里。
Hadoop学习之第一个MapReduce程序的更多相关文章
- Hadoop 6、第一个mapreduce程序 WordCount
1.程序代码 Map: import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.h ...
- Hadoop学习之路(5)Mapreduce程序完成wordcount
程序使用的测试文本数据: Dear River Dear River Bear Spark Car Dear Car Bear Car Dear Car River Car Spark Spark D ...
- 一起学Hadoop——使用IDEA编写第一个MapReduce程序(Java和Python)
上一篇我们学习了MapReduce的原理,今天我们使用代码来加深对MapReduce原理的理解. wordcount是Hadoop入门的经典例子,我们也不能免俗,也使用这个例子作为学习Hadoop的第 ...
- Hadoop学习基础之三:MapReduce
现在是讨论这个问题的不错的时机,因为最近媒体上到处充斥着新的革命所谓“云计算”的信息.这种模式需要利用大量的(低端)处理器并行工作来解决计算问题.实际上,这建议利用大量的低端处理器来构建数据中心,而不 ...
- 编写自已的第一个MapReduce程序
从进入系统学习到现在,貌似我们还没有真正开始动手写程序,估计有些立志成为Hadoop攻城狮的小伙伴们已经有些急了.环境已经搭好,小讲也有些按捺不住了.今天,小讲就和大家一起来动手编写我们的第一个Map ...
- Hadoop学习笔记—4.初识MapReduce
一.神马是高大上的MapReduce MapReduce是Google的一项重要技术,它首先是一个编程模型,用以进行大数据量的计算.对于大数据量的计算,通常采用的处理手法就是并行计算.但对许多开发者来 ...
- Hadoop(十三)分析MapReduce程序
前言 刚才发生了悲伤的一幕,本来这篇博客马上就要写好的,花了我一晚上的时间.但是刚才电脑没有插电源就没有了.很难受!想哭,但是没有办法继续站起来. 前面的一篇博文中介绍了什么是MapReduce,这一 ...
- HDFS设计思路,HDFS使用,查看集群状态,HDFS,HDFS上传文件,HDFS下载文件,yarn web管理界面信息查看,运行一个mapreduce程序,mapreduce的demo
26 集群使用初步 HDFS的设计思路 l 设计思想 分而治之:将大文件.大批量文件,分布式存放在大量服务器上,以便于采取分而治之的方式对海量数据进行运算分析: l 在大数据系统中作用: 为各类分布式 ...
- swift学习:第一个swift程序
原文:swift学习:第一个swift程序 最近swift有点火,赶紧跟上学习.于是,个人第一个swift程序诞生了... 新建项目
随机推荐
- mac篇---iterm2的基本常用命令
标签 新建标签:command + t 关闭标签:command + w 切换标签:command + 数字 command + 左右方向键 切换全屏:command + enter 查找:comma ...
- 什么?你还不会通过纯js提交表单?
如果程序已经封装好了, 不管后台是java .asp.net .还是php ?这个时候你的客户突然追加说我要 追加表单验证? what 妇产科 怎么办? submit 自带刷新效 ...
- docker安装,卸载和入门
Docker 简介 背景 开发和运维之间因为环境不同而导致的矛盾 集群环境下每台机器部署相同的应用 DevOps(Development and Operations) 简介 Docker是一个开源的 ...
- Android屏幕适配技巧
屏幕适配一直是困扰 Android 开发工程师的一大问题,但是随着近几年各种屏幕适配方案的诞生,以及谷歌各种适配控件的推出,屏幕适配也显得越来越容易,这节课我们就来总结一下关于屏幕适配的那些技巧. C ...
- Python 读取Excel之xlrd篇
上一期给大家分享了如何用Python读取文本,这次给大家分享如何读取Excel表格内容,拿最常见的.xlsx和.xls格式来讲解. 本章主要知识点有: 读取整篇excel返回list[list[lis ...
- JVM 专题二十一:垃圾回收(五)垃圾回收器 (二)
3. 回收器 3.1 Serial回收器:串行回收 3.1.1 概述 Serial收集器是最基本.历史最悠久的垃圾收集器了.JDK1.3之前回收新生代唯一的选择. Serial收集器作为Hotspot ...
- 数据可视化之powerBI入门(十)认识Power BI的核心概念:度量值
https://zhuanlan.zhihu.com/p/64150720 本文学习PowerBI最重要的概念:度量值 初学Power BI一般都会对度量值比较困惑,毕竟对长期接触Excel的人来说, ...
- 机器学习实战基础(四十一):随机森林 (八)附录 Bagging vs Boosting
- JavaScript 基础 学习 (二)
JavaScript 基础 学习 节点属性 每一个节点都有自己的特点 这个节点属性就记录着属于自己节点的特点 1. nodeType(以一个数字来表示这个节点类型) 语法:节点.nodeT ...
- Go的100天之旅-常量
常量 简介 道可道,非常道.这里常道指的永恒不变的道理,常有不变的意思.顾名思义和变量相比,常量在声明之后就不可改变,它的值是在编译期间就确定的. 下面简单的声明一个常量: const p int = ...