- map: (K1, V1) → list(K2, V2)
- reduce: (K2, list(V2)) → list(K3, V3)
- 能够从源码中看出为什么是这种类型:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
public
class
Mapper
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
public
class
Context
extends
MapContext
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
// ...
}
protected
void
map
(
KEYIN
key
,
VALUEIN
value
,
Context
context
)
throws
IOException
,
InterruptedException
{
// ...
}
}
public
class
Reducer
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
public
class
Context
extends
ReducerContext
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
// ...
}
protected
void
reduce
(
KEYIN
key
,
Iterable
<
VALUEIN
>
values
,
Context
context
)
throws
IOException
,
InterruptedException
{
// ...
}
}
context用来接收输出键值对,写出的方法是:
public
void
write
(
KEYOUT
key
,
VALUEOUT
value
)
throws
IOException
,
InterruptedException
假设有combiner :这里的 combiner就是默认的reducer
map: (K1, V1) → list(K2, V2)
combiner: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
假设partitioner被使用:
partition: (K2, V2) → integer(非常多时候仅仅取决于key 值被忽略来进行分区)
以及combiner 甚至partitioner让同样的key聚合到一起
public
abstract
class
Partitioner
<
KEY
,
VALUE
>
{
public
abstract
int
getPartition
(
KEY
key
,
VALUE
value
,
int
numPartitions
);
}
一个实现类:
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
-
输入数据的类型是通过输入格式进行设定的。比如,对于TextlnputFormat ,它的键类型就是LongWritable 。而值类型就是Text 。 其它的类型能够通过调用JobConf 中的方法来进行显式地设置。假设没有显式地设置。 中阔的类型将被默认设置为(终于的)输出类型,也就是LongWritable 和Text.综上所述,假设K2 与K3是同样类型,就不须要手工调用setMapOutputKeyClass,由于它将被自己主动设置每个步骤的输入和输出类型.一定非常奇怪,为什么不能从最初输入的类型推导出每个步骤的输入/输出类型呢?
原来Java 的泛型机制具有非常多限制,类型擦除导致了运行时类型并不一直可见.所以须要Hadoop 时不时地"提醒"一下。这也导致了可能在某些MapReduce 任务中出现不兼容的输入和输出类型,由于这些配置在编译时无法检查出来。与MapReduce 任务兼容的类型已经在以下列出。全部的类型不兼容将在任务真正运行的时候被发现,所以一个比較聪明的做法是在运行任务前先用少量的数据跑一次測试任务。以发现全部的类型不兼容问题。
Table 8-1. Configuration of MapReduce types in the new API
mapreduce.job.inputformat.class |
mapreduce.map.output.key.class |
mapreduce.map.output.value.class |
mapreduce.job.output.key.class |
mapreduce.job.output.value.class |
that must be consistent with the types: |
mapreduce.job.combine.class |
mapreduce.job.partitioner.class |
mapreduce.job.output.key.comparator.class |
mapreduce.job.output.group.comparator.class |
setGroupingComparatorClass() |
mapreduce.job.reduce.class |
mapreduce.job.outputformat.class |
Table 8-2. Configuration of MapReduce types in the old API
mapred.input.format.class |
mapred.mapoutput.key.class |
mapred.mapoutput.value.class |
mapred.output.value.class |
that must be consistent with the types: |
mapred.output.key.comparator.class |
setOutputKeyComparatorClass() |
mapred.output.value.groupfn.class |
setOutputValueGroupingComparator() |
mapred.output.format.class |
public
class
MinimalMapReduce
extends
Configured
implements
Tool
{
public
int
run
(
String
[]
args
)
throws
Exception
{
System
.
err
.
printf
(
"Usage: %s [generic options] <input> <output>\n"
,
getClass
().
getSimpleName
());
ToolRunner
.
printGenericCommandUsage
(
System
.
err
);
Job
job
=
new
Job
(
getConf
());
job
.
setJarByClass
(
getClass
());
FileInputFormat
.
addInputPath
(
job
,
new
Path
(
args
[
0
]));
FileOutputFormat
.
setOutputPath
(
job
,
new
Path
(
args
[
1
]));
return
job
.
waitForCompletion
(
true
)
?
0
:
1
;
public
static
void
main
(
String
[]
args
)
throws
Exception
{
int
exitCode
=
ToolRunner
.
run
(
new
MinimalMapReduce
(),
args
);
hadoop MinimalMapReduce "input/ncdc/all/190{1,2}.gz" output
0→0029029070999991901010106004+64333+023450FM-12+000599999V0202701N01591...
0→0035029070999991902010106004+64333+023450FM-12+000599999V0201401N01181...
135→0029029070999991901010113004+64333+023450FM-12+000599999V0202901N00821...
141→0035029070999991902010113004+64333+023450FM-12+000599999V0201401N01181...
270→0029029070999991901010120004+64333+023450FM-12+000599999V0209991C00001...
282→0035029070999991902010120004+64333+023450FM-12+000599999V0201401N01391...
public
class
MinimalMapReduceWithDefaults
extends
Configured
implements
Tool
{
public
int
run
(
String
[]
args
)
throws
Exception
{
Job
job
=
JobBuilder
.
parseInputAndOutput
(
this
,
getConf
(),
args
);
job
.
setInputFormatClass
(
TextInputFormat
.
class
);
job
.
setMapperClass
(
Mapper
.
class
);
job
.
setMapOutputKeyClass
(
LongWritable
.
class
);
job
.
setMapOutputValueClass
(
Text
.
class
);
job
.
setPartitionerClass
(
HashPartitioner
.
class
);
job
.
setNumReduceTasks
(
1
);
job
.
setReducerClass
(
Reducer
.
class
);
job
.
setOutputKeyClass
(
LongWritable
.
class
);
job
.
setOutputValueClass
(
Text
.
class
);
job
.
setOutputFormatClass
(
TextOutputFormat
.
class
);
return
job
.
waitForCompletion
(
true
)
?
0
:
1
;
public
static
void
main
(
String
[]
args
)
throws
Exception
{
int
exitCode
=
ToolRunner
.
run
(
new
MinimalMapReduceWithDefaults
(),
args
);
public
class
Mapper
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
protected
void
map
(
KEYIN
key
,
VALUEIN
value
,
Context
context
)
throws
IOException
,
InterruptedException
{
context
.
write
((
KEYOUT
)
key
,
(
VALUEOUT
)
value
);
默认Partitioner:hash切割,默认仅仅有一个reducer因此我们这里仅仅有一个分区
class
HashPartitioner
<
K
,
V
>
extends
Partitioner
<
K
,
V
>
{
public
int
getPartition
(
K
key
,
V
value
,
return
(
key
.
hashCode
()
&
Integer
.
MAX_VALUE
)
%
numReduceTasks
;
public
class
Reducer
<
KEYIN
,
VALUEIN
,
KEYOUT
,
VALUEOUT
>
{
protected
void
reduce
(
KEYIN
key
,
Iterable
<
VALUEIN
>
values
,
Context
context
Context
context
)
throws
IOException
,
InterruptedException
{
for
(
VALUEIN
value:
values
)
{
context
.
write
((
KEYOUT
)
key
,
(
VALUEOUT
)
value
);
由于什么都没做,仅仅是在map中读取了偏移量和value,分区使用的hash,一个reduce输出的便是我们上面看到的样子。
相对于java api,hadoop流也有最简的mapreduce:
%
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input input/ncdc/sample.txt \
%
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input input/ncdc/sample.txt \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
一个流操作的程序能够改动输入的分隔符(用于将键与值从输入文件里分开而且传入mapper) 。默认情况下是Tab ,可是假设输入的键或值中本身有Tab 分隔符的话,最好将分隔符改动成其它符号。
类似地,当map 和reduc e 将结果输出的时候, 也须要一个能够配置的分隔符选项。更进一步, 键能够不仅仅是每一条记录的第1 个字段,它能够是一条记录的前n 个字段(能够在stream.num.map.output.key.fields和stream.num.reduce.output.key.fields 中进行设置) 。而剩下的字段就是值。比方有一条记录是a 。b , C 。 且用逗号分隔,假设n 设为2 ,那么键就是a 、b 。而值就是c 。
Table 8-3. Streaming separator properties
stream.map.input.field.separator |
separator to use when passing the input key and value strings to the stream map process as a stream of bytes |
stream.map.output.field.separator |
separator to use when splitting the output from the stream map process into key and value strings for the map output |
stream.num.map.output.key.fields |
number of fields separated bystream.map. output.field.separator to
|
treat as the map output key |
stream.reduce.input.field.separator |
to use when passing the input key and value strings to the stream reduce process as a stream of bytes |
stream.reduce.output.field.separator |
separator to use when splitting the output from the stream reduce process into key and value strings for the final reduce output |
stream.num.reduce.output.key.fields |
number of fields separated bystream.reduce.output.field. separator to
|
treat as the reduce output key |
mapreduce中分隔符使用的地方。在标准输入输出和map-reducer之间。
-