参考

  • hadoop权威指南 第六章,6.4节

背景

hadoop,mapreduce就如MVC,spring一样现在已经是烂大街了,虽然用过,但是说看过源码么,没有,调过参数么?调过,调到刚好能跑起来。现在有时间看看hadoop权威指南,感觉真是走了许多弯路。

MR流程

参数

共同影响

io.sort.factor

多路合并允许的最大输入路数。设成较大的值可以减少合并轮数,从而减少磁盘读写次数。

map端

io.sort.mb

map端输出缓冲区大小,map输出先放到这里然后在通过排序和partition再写入本地磁盘,等待再次merge直到map过程结束数据被reduce端获取。

io.sort.spill.percent

map端输出数据占输出缓冲区多少比例时开始刷出到磁盘。这个应该取决于map端输出速度和磁盘写入速度比例,就是一个一般的有界缓冲+生产者消费者问题。

reduce端

mapred.job.shuffle.input.buffer.percent

reduce端的输入缓冲区比例(占JVM堆空间),如果从map端拉取到的数据全部能够放下则可以直接在内存中完成map的输出合并,不用写入磁盘,直接作为reduce的输入。

注意这是个坑,当此比例与JVM可用堆空间乘积超过Intger.MAX_VALUE时会不声不响的使用Intger.MAX_VALUE作为上限。必须强行设置mapreduce.reduce.memory.totalbytes参数来定义最大可用堆大小

mapred.job.shuffle.merge.percent

reduce端的输入缓冲区使用达到多少比例时开始merge到磁盘的过程。即当reduce端接收map端数据超过heapsize * mapred.job.shuffle.input.buffer.percent * mapred.job.shuffle.merge.percent开始向本地磁盘输出merge结果。

mapred.inmem.merge.threshold

reduce端的输入缓冲区使用达到多少大小(MB)时开始merge到磁盘的过程。这里使用的是一个具体数值而不是比例。如果把这项设为0,则控制有比例参数计算得出。

mapred.job.reduce.input.buffer.percent

reduce端在进行reduce操作之前剩余在输入缓冲区的数据占堆空间的比例。因为reduce端最后一趟reduce的输入不用完全来自磁盘,它可以通过多路merge的过程直接获取来自磁盘或者内存(内存中的是已合并但为输出到磁盘的map输出数据)的数据。如果设定为0的话就是强制把缓冲清空,将所有合并结果写入磁盘。

这里也是有个坑,和上面的一样最多得到的大小不会超过2GB,也没什么附加参数可以修正的。

实验

对4GB整数(存储数字文件的大小,以文本形式存储)进行一个排序。

环境

Hadoop 2.6.0
1 Namenode + 1 ResourceManager + 3 DataNode&NodeManager

实验准备

使用如下命令产生4个包含随机整数的文本文件:

echo $(od -An -N4 -i /dev/urandom) >> out.data

生成的每个文件约1GB。因为使用了Linux上的随机数发生器,生成数据的过程有些慢,可以在四台机器上分别进行,最后将得到的数据文件上传到HDFS的(当前用户的home目录中)sort_integer文件夹中:

ubuntu@dev00:~/sort-mr$ hadoop fs -ls sort_integer
Found 4 items
-rw-r--r-- 1 ubuntu supergroup 1043473519 2015-08-06 13:05 sort_integer/int00.data
-rw-r--r-- 1 ubuntu supergroup 1075257196 2015-08-06 13:07 sort_integer/int01.data
-rw-r--r-- 1 ubuntu supergroup 1063854482 2015-08-06 13:08 sort_integer/int02.data
-rw-r--r-- 1 ubuntu supergroup 1086774112 2015-08-06 13:08 sort_integer/int03.data

MapReduce程序

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import java.io.IOException; class SortMapper extends Mapper<Object, Text, LongWritable, IntWritable> {
private LongWritable num = new LongWritable(0);
private IntWritable one = new IntWritable(1);
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
num.set(Long.valueOf(value.toString().trim()));
context.write(num, one);
}
} class SortReducer extends Reducer<LongWritable, IntWritable, LongWritable, NullWritable> {
@Override
protected void reduce(LongWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for (IntWritable i : values) {
count += i.get();
}
for (int i=0; i<count; i++) {
context.write(key, NullWritable.get());
}
}
} public class SortMR {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("sort <input file/dir> <output dir>");
return;
} Configuration conf = new Configuration(); conf.set(); Job job = Job.getInstance(conf, "sort-int"); job.setJarByClass(SortMR.class); job.setMapperClass(SortMapper.class); job.setReducerClass(SortReducer.class); job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(NullWritable.class); job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setNumReduceTasks(1);
job.waitForCompletion(true);
}
}

实验CASE

默认配置运行

15/08/07 04:38:03 INFO mapreduce.Job: Running job: job_1438916755596_0002
15/08/07 04:38:10 INFO mapreduce.Job: Job job_1438916755596_0002 running in uber mode : false
15/08/07 04:38:10 INFO mapreduce.Job: map 0% reduce 0%
15/08/07 04:38:26 INFO mapreduce.Job: map 3% reduce 0%
15/08/07 04:38:27 INFO mapreduce.Job: map 4% reduce 0%
...
15/08/07 04:39:14 INFO mapreduce.Job: map 33% reduce 0%
15/08/07 04:39:15 INFO mapreduce.Job: map 34% reduce 0%
15/08/07 04:39:17 INFO mapreduce.Job: map 36% reduce 0%
15/08/07 04:39:18 INFO mapreduce.Job: map 36% reduce 4%
15/08/07 04:39:19 INFO mapreduce.Job: map 37% reduce 4%
15/08/07 04:39:21 INFO mapreduce.Job: map 37% reduce 5%
15/08/07 04:39:24 INFO mapreduce.Job: map 37% reduce 7%
15/08/07 04:39:28 INFO mapreduce.Job: map 38% reduce 9%
...
15/08/07 04:41:06 INFO mapreduce.Job: map 92% reduce 25%
15/08/07 04:41:08 INFO mapreduce.Job: map 93% reduce 26%
15/08/07 04:41:09 INFO mapreduce.Job: map 94% reduce 26%
15/08/07 04:41:14 INFO mapreduce.Job: map 95% reduce 27%
15/08/07 04:41:17 INFO mapreduce.Job: map 96% reduce 28%
15/08/07 04:41:19 INFO mapreduce.Job: map 97% reduce 28%
15/08/07 04:41:21 INFO mapreduce.Job: map 98% reduce 28%
15/08/07 04:41:23 INFO mapreduce.Job: map 99% reduce 29%
15/08/07 04:41:26 INFO mapreduce.Job: map 100% reduce 31%
15/08/07 04:41:29 INFO mapreduce.Job: map 100% reduce 32%
15/08/07 04:41:32 INFO mapreduce.Job: map 100% reduce 33%
15/08/07 04:46:22 INFO mapreduce.Job: map 100% reduce 34%
15/08/07 04:46:25 INFO mapreduce.Job: map 100% reduce 35%
15/08/07 04:46:28 INFO mapreduce.Job: map 100% reduce 36%
...
15/08/07 04:56:56 INFO mapreduce.Job: map 100% reduce 98%
15/08/07 04:57:14 INFO mapreduce.Job: map 100% reduce 99%
15/08/07 04:57:32 INFO mapreduce.Job: map 100% reduce 100%
15/08/07 04:57:42 INFO mapreduce.Job: Job job_1438916755596_0002 completed successfully

整个过程耗时约20分钟,mapper用了3分钟全部完成。reducer上面的时间则比较长。

        Job Counters
Killed map tasks=1
Launched map tasks=66
Launched reduce tasks=1
Data-local map tasks=51
Rack-local map tasks=15
Total time spent by all maps in occupied slots (ms)=3658033
Total time spent by all reduces in occupied slots (ms)=1116041
Total time spent by all map tasks (ms)=3658033
Total time spent by all reduce tasks (ms)=1116041
Total vcore-seconds taken by all map tasks=3658033
Total vcore-seconds taken by all reduce tasks=1116041
Total megabyte-seconds taken by all map tasks=3745825792
Total megabyte-seconds taken by all reduce tasks=1142825984
Map-Reduce Framework
Map input records=388738323
Map output records=388738323
Map output bytes=4664859876
Map output materialized bytes=5442336912
Input split bytes=7670
Combine input records=0
Combine output records=0
Reduce input groups=371667599
Reduce shuffle bytes=5442336912
Reduce input records=388738323
Reduce output records=388738323
Spilled Records=1527999455
Shuffled Maps =65
Failed Shuffles=0
Merged Map outputs=65
GC time elapsed (ms)=59949
CPU time spent (ms)=2838000
Physical memory (bytes) snapshot=17842167808
Virtual memory (bytes) snapshot=46181015552
Total committed heap usage (bytes)=13626769408

从输出统计可以看到总共输入记录数为388738323,最后reduce输出记录总数为388738323两者是一致的,至少数量上没有问题。

增大合并路数

即修改io.sort.factor,同时进行合并的路数,减少反复合并写入读取磁盘的次数。这个factor数值越大则需要进行merge的轮数就越少。

conf.setInt("io.sort.factor", 100);

按照书上的例举,把它先设置为100。

15/08/07 08:29:31 INFO mapreduce.Job: Running job: job_1438916755596_0005
15/08/07 08:29:38 INFO mapreduce.Job: Job job_1438916755596_0005 running in uber mode : false
15/08/07 08:29:38 INFO mapreduce.Job: map 0% reduce 0%
15/08/07 08:29:54 INFO mapreduce.Job: map 2% reduce 0%
...
15/08/07 08:30:42 INFO mapreduce.Job: map 34% reduce 0%
15/08/07 08:30:43 INFO mapreduce.Job: map 35% reduce 4%
15/08/07 08:30:46 INFO mapreduce.Job: map 36% reduce 5%
15/08/07 08:30:49 INFO mapreduce.Job: map 37% reduce 7%
15/08/07 08:30:52 INFO mapreduce.Job: map 38% reduce 9%
15/08/07 08:30:55 INFO mapreduce.Job: map 38% reduce 11%
15/08/07 08:30:57 INFO mapreduce.Job: map 39% reduce 11%
15/08/07 08:30:59 INFO mapreduce.Job: map 41% reduce 11%
...
15/08/07 08:32:51 INFO mapreduce.Job: map 98% reduce 27%
15/08/07 08:32:53 INFO mapreduce.Job: map 99% reduce 27%
15/08/07 08:32:54 INFO mapreduce.Job: map 100% reduce 28%
15/08/07 08:32:57 INFO mapreduce.Job: map 100% reduce 30%
15/08/07 08:32:59 INFO mapreduce.Job: map 100% reduce 32%
15/08/07 08:33:02 INFO mapreduce.Job: map 100% reduce 44%
15/08/07 08:33:05 INFO mapreduce.Job: map 100% reduce 67%
15/08/07 08:33:21 INFO mapreduce.Job: map 100% reduce 68%
15/08/07 08:33:39 INFO mapreduce.Job: map 100% reduce 69%
...
15/08/07 08:43:03 INFO mapreduce.Job: map 100% reduce 99%
15/08/07 08:43:18 INFO mapreduce.Job: map 100% reduce 100%
15/08/07 08:43:27 INFO mapreduce.Job: Job job_1438916755596_0005 completed successfully

总共耗时14min,可以看到在这里mapper时间还是大致和原先的一样,因为总共的数据在4GB左右而启动了66个Mapper,那么每个mapper划分到了约60MB的数据(mapper输出的数据也不会膨胀,因为输入是文本表示的数字而输出中间结果是LongWritable),这样的数据大小在默认的io.sort.mb为100MB的情况下都可以直接在内存中完成排序,并不需要外部的merge过程,因而io.sort.factor不会对Map过程产生什么影响。不过reducer的时间明显减少了,因为这个MR任务主要负担还是在reducer端,它需要对Mapper端得到数据进行merge操作,也就是说至少有66个mapper输出需要merge,那么原来io.sort.factor为10就不能一次性的merge至少完成需要两轮merge过程(第一轮66->6,第二轮6->1)。而如果把参数调整到100那么只需要一轮merge就可以完成了。

 Job Counters
Killed map tasks=1
Launched map tasks=66
Launched reduce tasks=1
Data-local map tasks=57
Rack-local map tasks=9
Total time spent by all maps in occupied slots (ms)=3727216
Total time spent by all reduces in occupied slots (ms)=775881
Total time spent by all map tasks (ms)=3727216
Total time spent by all reduce tasks (ms)=775881
Total vcore-seconds taken by all map tasks=3727216
Total vcore-seconds taken by all reduce tasks=775881
Total megabyte-seconds taken by all map tasks=3816669184
Total megabyte-seconds taken by all reduce tasks=794502144 Map-Reduce Framework
Map input records=388738323
Map output records=388738323
Map output bytes=4664859876
Map output materialized bytes=5442336912
Input split bytes=7670
Combine input records=0
Combine output records=0
Reduce input groups=371667599
Reduce shuffle bytes=5442336912
Reduce input records=388738323
Reduce output records=388738323
Spilled Records=1165028413
Shuffled Maps =65
Failed Shuffles=0
Merged Map outputs=65
GC time elapsed (ms)=56981
CPU time spent (ms)=2405160
Physical memory (bytes) snapshot=17848528896
Virtual memory (bytes) snapshot=46269399040
Total committed heap usage (bytes)=13616283648

从counter数据中我们可以发现默认参数下reducer需要1116s=18min完成,而调整了io.sort.factor后只需要775s=13min这是相当大的提升,时间减少了30%。由于我们估算原来需要两轮merge,现在只需要一轮,而其他参数都没变,可以推出一次merge的时间约为(1116 - 775)s = 341 s。

Spilled Records

这个按照标准的说法就是mapper和reducer中在各自工作时溢出到磁盘的记录数。什么叫溢出就是从一些缓冲中存储到硬盘上的过程,如果map后的sort缓冲,reducer端进行merge的缓冲。通过观察可以发现默认配置下(spilled records)/ (Reduce-output-records)约为3.93 =>4,而调整io.sort.factor后此比值为2.99 => 3,也就是说中间写到磁盘的记录数少了输出结果的数量,而一轮merge的写磁盘记录数也刚刚为这个数值,由此也可以推断调整参数却是减少了一轮merge过程。

但这里有个问题,就是spilled records数量是要排序数字的三倍,根据原有的分析如果map端内存完全可以容纳下mapper输出数据,那么map过程其实只有一次完整的spill,总的数量和记录数一致。而reducer端按照分析也只需要一次merge,也就是说这个比值应该在2,不知道哪里的写操作也算到了spilled records里面。

增大mapper端排序内存

即修改io.sort.mb的值,不过由于mapper本身时间就比较短,估计这个参数调整不会有太多作用。直接去调整内存值的话一般会报错(OutOfMemory)。因为默认配置给JVM虚拟机的空间最大为200MB,所以这里还要同时修改一下JVM的内存上限,即mapred.child.java.optsmapreduce.map.memory.mb前者是传给JVM的堆大小的参数,后者则用于描述整个JVM大概会占用的大小(还包括由它创建出来的进程),所以后者肯定是比前者要来的大的。三者的关系应该满足io.sort.mb < mapred.child.java.opts < mapreduce.map.memory.mb,调整参数代码:

        conf.setInt("io.sort.mb", 500); // set io.sort.mb = 500MB
conf.set("mapred.child.java.opts", "-Xmx800m"); // JVM HEAP = 800MB, default = 200MB

将map输出缓冲空间调整为500MB,相应的也增大了map端JVM的堆大小。

15/08/07 10:50:15 INFO mapreduce.Job: Job job_1438916755596_0008 running in uber mode : false
15/08/07 10:50:15 INFO mapreduce.Job: map 0% reduce 0%
...
15/08/07 10:51:15 INFO mapreduce.Job: map 34% reduce 0%
15/08/07 10:51:16 INFO mapreduce.Job: map 37% reduce 4%
...
15/08/07 10:53:11 INFO mapreduce.Job: map 99% reduce 15%
15/08/07 10:53:12 INFO mapreduce.Job: map 100% reduce 15%
...
15/08/07 11:07:17 INFO mapreduce.Job: map 100% reduce 100%
15/08/07 11:07:26 INFO mapreduce.Job: Job job_1438916755596_0008 completed successfully Job Counters
Killed map tasks=1
Launched map tasks=66
Launched reduce tasks=1
Data-local map tasks=57
Rack-local map tasks=9
Total time spent by all maps in occupied slots (ms)=3330624
Total time spent by all reduces in occupied slots (ms)=981770
Total time spent by all map tasks (ms)=3330624
Total time spent by all reduce tasks (ms)=981770
Total vcore-seconds taken by all map tasks=3330624
Total vcore-seconds taken by all reduce tasks=981770
Total megabyte-seconds taken by all map tasks=3410558976
Total megabyte-seconds taken by all reduce tasks=1005332480
Map-Reduce Framework
Map input records=388738323
Map output records=388738323
Map output bytes=4664859876
Map output materialized bytes=5442336912
Input split bytes=7670
Combine input records=0
Combine output records=0
Reduce input groups=371667599
Reduce shuffle bytes=5442336912
Reduce input records=388738323
Reduce output records=388738323
Spilled Records=891105922
Shuffled Maps =65
Failed Shuffles=0
Merged Map outputs=65
GC time elapsed (ms)=217041
CPU time spent (ms)=2437680
Physical memory (bytes) snapshot=50441203712
Virtual memory (bytes) snapshot=89002385408
Total committed heap usage (bytes)=44534071296

由数据可知,总时间约为17min,(Spilled Records)/(Reduce output records) = 2.29,说明增大对mapper端的内存还是有一定效果的(mapper可能产生了超过100MB默认缓冲的数据,但是根据估算的话应该只有67MB左右的空间占用),使得spilled数明显减少。Total time spent by all map tasks的数值也相比前面两者降低了一些。

增大reducer端merge内存

在这个MR任务中,这个参数应该是最能提高速度的,由于reducer只有一个,我们可以把reducer的内存设的大一些,比如5GB使它能够容纳下mapper端的大部分输出。merge过程就可以在内存中进行了。merge所用的内存可以从比例和绝对大小进行设定,这里只使用比例设定,由于比例设定是按照JVM堆大小来定的所以我们需要对两个参数同时做修改。

        conf.setInt("mapreduce.reduce.memory.mb", 5500);        // JVM process & its sub processes
conf.set("mapreduce.reduce.java.opts", "-Xmx5000m"); // JVM max heap size
conf.setFloat("mapred.job.shuffle.input.buffer.percent", 0.9f); // using percentage
conf.setFloat("mapreduce.reduce.shuffle.merge.percent", 0.9f); // merge / total
conf.setInt("mapred.inmem.merge.threshold", 0); // disable value threshold

运行输出

15/08/07 14:29:54 INFO mapreduce.Job: Job job_1438916755596_0009 running in uber mode : false
15/08/07 14:29:54 INFO mapreduce.Job: map 0% reduce 0%
15/08/07 14:30:12 INFO mapreduce.Job: map 2% reduce 0%
15/08/07 14:30:13 INFO mapreduce.Job: map 4% reduce 0%
15/08/07 14:30:14 INFO mapreduce.Job: map 7% reduce 0%
15/08/07 14:30:15 INFO mapreduce.Job: map 11% reduce 0%
..
15/08/07 14:33:36 INFO mapreduce.Job: map 98% reduce 19%
15/08/07 14:33:42 INFO mapreduce.Job: map 99% reduce 19%
15/08/07 14:33:49 INFO mapreduce.Job: map 100% reduce 19%
...
15/08/07 14:35:31 INFO mapreduce.Job: map 100% reduce 33%
15/08/07 14:36:37 INFO mapreduce.Job: map 100% reduce 39%
...
15/08/07 14:44:23 INFO mapreduce.Job: map 100% reduce 99%
15/08/07 14:44:38 INFO mapreduce.Job: map 100% reduce 100%
15/08/07 14:44:45 INFO mapreduce.Job: Job job_1438916755596_0009 completed successfully
Job Counters
Killed map tasks=1
Launched map tasks=66
Launched reduce tasks=1
Data-local map tasks=63
Rack-local map tasks=3
Total time spent by all maps in occupied slots (ms)=3736173
Total time spent by all reduces in occupied slots (ms)=4700526
Total time spent by all map tasks (ms)=3736173
Total time spent by all reduce tasks (ms)=783421
Total vcore-seconds taken by all map tasks=3736173
Total vcore-seconds taken by all reduce tasks=783421
Total megabyte-seconds taken by all map tasks=3825841152
Total megabyte-seconds taken by all reduce tasks=4308815500
Map-Reduce Framework
Map input records=388738323
Map output records=388738323
Map output bytes=4664859876
Map output materialized bytes=5442336912
Input split bytes=7670
Combine input records=0
Combine output records=0
Reduce input groups=371667599
Reduce shuffle bytes=5442336912
Reduce input records=388738323
Reduce output records=388738323
Spilled Records=1165028413
Shuffled Maps =65
Failed Shuffles=0
Merged Map outputs=65
GC time elapsed (ms)=51656
CPU time spent (ms)=2575510
Physical memory (bytes) snapshot=21428072448
Virtual memory (bytes) snapshot=51386806272
Total committed heap usage (bytes)=16239820800

总共耗时约15min,仅次与调整io.sort.factor的14min。不过从(Spilled Records) / (Reduce output records) = 2.99,可以发现它与调整io.sort.factor时的情况非常类似。这个情况还是有些问题的,于是来看一下reducer端的日志:

>2015-08-08 04:28:55,296 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2015-08-08 04:28:55,360 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2015-08-08 04:28:55,360 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started
2015-08-08 04:28:55,371 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2015-08-08 04:28:55,371 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1439006457627_0002, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@411e6b7e)
2015-08-08 04:28:55,463 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2015-08-08 04:28:55,902 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /tmp/hadoop/nm-local-dir/usercache/ubuntu/appcache/application_1439006457627_0002
2015-08-08 04:28:56,675 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2015-08-08 04:28:57,117 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2015-08-08 04:28:57,160 INFO [main] org.apache.hadoop.mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@38ed6f20
2015-08-08 04:28:57,178 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: MergerManager: memoryLimit=1932735232, maxSingleShuffleLimit=483183808, mergeThreshold=1739461632, ioSortFactor=10, memToMemMergeOutputsThreshold=10

最后一行有关于MergeManagerImpl这个类获得的一些参数:

memoryLimit=1932735232 约为1.9GB

maxSingleShuffleLimit=483183808 约为480MB

mergeThreshold=1739461632 约为1.7GB

ioSortFactor=10这个是默认值

memToMemMergeOutputsThreshold=10这个暂时不管

可以发现mergeThreshold * 0.9 = 1.7 和 mergeThreshold比较接近,应该说mapreduce.reduce.shuffle.merge.percent是起到了作用的(即占用多少input缓冲后开始merge操作并输出)。但是我们明明在程序中给reducer任务分配了5GB的内存为什么这里的上限是按照1.9GB来算呢?是不是mapreduce.reduce.java.opts参数没有起作用?后来重新运行任务后,在执行reducer任务的机器上执行ps命令发现JVM启动参数中包含了关于内存的配置。既然输出的log数值有疑问,下面就去看看这个MergeManagerImpl类,看其是如何计算得到这个数值的。

http://www.cnblogs.com/lailailai/p/4713105.html

Hadoop Mapreduce 参数 (一)的更多相关文章

  1. Hadoop Mapreduce 参数 (二)

    MergeManagerImpl 类 内存参数计算 maxInMemCopyUse 位于构造函数中 final float maxInMemCopyUse = jobConf.getFloat(MRJ ...

  2. 使用eclipse的快捷键自动生成的map或者reduce函数的参数中:“org.apache.hadoop.mapreduce.Reducer.Context context”

    今天在测试mapreduce的程序时,就是简单的去重,对照课本上的程序和自己的程序,唯一不同的就是“org.apache.hadoop.mapreduce.Reducer.Context contex ...

  3. 【Hadoop离线基础总结】MapReduce参数优化

    MapReduce参数优化 资源相关参数 这些参数都需要在mapred-site.xml中配置 mapreduce.map.memory.mb 一个 MapTask 可使用的资源上限(单位:MB),默 ...

  4. Hadoop MapReduce编程 API入门系列之薪水统计(三十一)

    不多说,直接上代码. 代码 package zhouls.bigdata.myMapReduce.SalaryCount; import java.io.IOException; import jav ...

  5. Mapreduce参数调节

    http://blog.javachen.com/2014/06/24/tuning-in-mapreduce/ 本文主要记录Hadoop 2.x版本中MapReduce参数调优,不涉及Yarn的调优 ...

  6. Hadoop MapReduce开发最佳实践(上篇)

    body{ font-family: "Microsoft YaHei UI","Microsoft YaHei",SimSun,"Segoe UI& ...

  7. [Hadoop] - Hadoop Mapreduce Error: GC overhead limit exceeded

    在运行mapreduce的时候,出现Error: GC overhead limit exceeded,查看log日志,发现异常信息为 2015-12-11 11:48:44,716 FATAL [m ...

  8. 从分治算法到 Hadoop MapReduce

    从分治算法说起 要说 Hadoop MapReduce 就不得不说分治算法,而分治算法其实说白了,就是四个字 分而治之 .其实就是将一个复杂的问题分解成多组相同或类似的子问题,对这些子问题再分,然后再 ...

  9. hadoop MapReduce

    简单介绍 官方给出的介绍是hadoop MR是一个用于轻松编写以一种可靠的.容错的方式在商业化硬件上的大型集群上并行处理大量数据的应用程序的软件框架. MR任务通常会先把输入的数据集切分成独立的块(可 ...

随机推荐

  1. dubbo-admin 出现警告(不影响使用)

    <dubbo:application name="pyg-sellergoods-s" />. <dubbo:application name="pyg ...

  2. python 通过pytz模块进行时区的转换,获取指定时区的时间

    import pytz import time import datetime print(pytz.country_timezones('cn')) # 查询中国所拥有的时区 print(pytz. ...

  3. collections, time, queue的应用

    collections  (克来克深思) Counter from collections import Counter # 引入模块, 计数器 Counter(康特) s = 'sadfasdfas ...

  4. tcp server

    SO_REUSEADDR Ignore SIGPIPE TCP_NODELAY TCP_QUICKACK

  5. Redis中的批量操作Pipeline

    大多数情况下,我们都会通过请求-相应机制去操作redis.只用这种模式的一般的步骤是,先获得jedis实例,然后通过jedis的get/put方法与redis交互.由于redis是单线程的,下一次请求 ...

  6. 【ElasticSearch】:Mapping相关

    Mapping 类似数据库中的表结构定义,主要作用如下: 定义Index下的字段名(Field Name). 定义字段类型,例如数值型.字符串型.布尔型等. 定义倒排索引相关配置,比如是否索引.记录p ...

  7. (转)WebSphere的web工程中怎么获取数据源

    原文:http://aguu125.iteye.com/blog/1694313 https://blog.csdn.net/bigtree_3721/article/details/44900325 ...

  8. python处理json格式的数据

    这里我就不介绍json了,不知道json的同学可以去百度一下json,首先我们的json的格式如下,这个json有点长,这个json来自我以前的一个小任务,具体看这里:http://www.cnblo ...

  9. SQL Server性能优化(6)查询语句建议

    1. 如果对数据不是工业级的访问(允许脏读),在select里添加 with(nolock) ID FROM Measure_heat WITH (nolock) 2. 限制结果集的数据量,如使用TO ...

  10. hao643.com劫持(修改快捷方式跳转至hao123.com)

    >症状:所有浏览器快捷方式,都被加上尾巴,例如IE的:"C:\Program Files\Internet Explorer\iexplore.exe" http://hao ...