Mongodb的mapreduce

简单的看了一下mapreduce，我尝试不看详细的api去做一个group效果，结果遇到了很多问题，罗列在这里，如果别人也遇到了类似的bug，可以检索到结果。

//先看person表的数据
> db.person.find();
{ "_id" : ObjectId("593011c8a92497992cdfac10"), "name" : "xhj", "age" : 30, "address" : DBRef("address", ObjectId("59314b07e693aae7a5eb72ab")) }
{ "_id" : ObjectId("59301270a92497992cdfac11"), "name" : "zzj", "age" : 2 }
{ "_id" : ObjectId("593015fda92497992cdfac12"), "name" : "my second child", "age" : "i do not know" }
{ "_id" : ObjectId("592ffd872108e8e79ea902b0"), "name" : "zjf", "age" : 30, "address" : { "province" : "河南省", "city" : "南阳市", "building" : "桐柏县" } }
//使用聚合来做一个group by
> db.person.aggregate({$group : {_id: '$age', count : {$sum : 1}}})
{ "_id" : "i do not know", "count" : 1 }
{ "_id" : 2, "count" : 1 }
{ "_id" : 30, "count" : 2 }
//下面尝试用map reduce来做同样的group by效果
//很简单的逻辑定义map函数和reduce函数
> var m = function(){ emit(this.age,1) };
> var r = function(key,values){
... var sum = 0;
... values.forEach(function(val){
... sum += val;
... });
... return sum;
... }
//然后在person上执行mapreduce 这样会报错需要一个optionsOrOutString
> db.person.mapReduce( m, r ).find();
assert failed : need to supply an optionsOrOutString
Error: assert failed : need to supply an optionsOrOutString
at Error (<anonymous>)
at doassert (src/mongo/shell/assert.js:11:14)
at assert (src/mongo/shell/assert.js:20:5)
at DBCollection.mapReduce (src/mongo/shell/collection.js:1343:5)
at (shell):1:11
2017-06-03T12:42:06.704+0800 E QUERY Error: assert failed : need to supply an optionsOrOutString
at Error (<anonymous>)
at doassert (src/mongo/shell/assert.js:11:14)
at assert (src/mongo/shell/assert.js:20:5)
at DBCollection.mapReduce (src/mongo/shell/collection.js:1343:5)
at (shell):1:11 at src/mongo/shell/assert.js:13
//加了一个而空的option 又说要有一个string或者object的out参数
> db.person.mapReduce( m, r,{} ).find();
2017-06-03T12:42:24.726+0800 E QUERY Error: map reduce failed:{
"errmsg" : "exception: 'out' has to be a string or an object",
"code" : 13606,
"ok" : 0
}
at Error (<anonymous>)
at DBCollection.mapReduce (src/mongo/shell/collection.js:1353:15)
at (shell):1:11 at src/mongo/shell/collection.js:1353
//我尝试定义一个变量不行
> var outstr;
> db.person.mapReduce( m, r,{out:outstr} ).find();
2017-06-03T12:42:45.502+0800 E QUERY Error: map reduce failed:{
"errmsg" : "exception: 'out' has to be a string or an object",
"code" : 13606,
"ok" : 0
}
at Error (<anonymous>)
at DBCollection.mapReduce (src/mongo/shell/collection.js:1353:15)
at (shell):1:11 at src/mongo/shell/collection.js:1353
//后来我了解到out需要的一个collection 于是我加了一个字符串 'outt'作为保存数据的集合名字
> db.person.mapReduce( m, r,{out:'outt'} ).find();
{ "_id" : 2, "value" : 1 }
{ "_id" : 30, "value" : 2 }
{ "_id" : "i do not know", "value" : 1 }
//此时outt中也保存了数据我不明白的是不定义out参数不是应该可以直接find就可以了吗为什么要多此一举呢
> db.outt.find();
{ "_id" : 2, "value" : 1 }
{ "_id" : 30, "value" : 2 }
{ "_id" : "i do not know", "value" : 1 }

因为遇到了这么多问题，所以看了Mongodb的文档（https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/），翻译并梳理了一下，总结如下：

命令方式：

db.runCommand(
{
mapReduce: <collection>,
map: <function>,
reduce: <function>,
finalize: <function>,
out: <output>,
query: <document>,
sort: <document>,
limit: <number>,
scope: <document>,
jsMode: <boolean>,
verbose: <boolean>,
bypassDocumentValidation: <boolean>,
collation: <document>
}
)

简单方式：

db.collection.mapReduce(
<map>,
<reduce>,
{
out: <collection>,
query: <document>,
sort: <document>,
limit: <number>,
finalize: <function>,
scope: <document>,
jsMode: <boolean>,
verbose: <boolean>,
bypassDocumentValidation: <boolean>
}
)

db.collection.mapReduce()使用起来更加方便一点，它的参数如下：

Parameter	Type	Description
map	function	A JavaScript function that associates or "maps" a value with akey and emits the key and value pair. 一个根据查询数据生成键值对的js函数
reduce	function	A JavaScript function that "reduces" to a single object all the values associated with a particular key. 一个将特定的键的所有值 reduces成一个值的js函数。
options	document	A document that specifies additional parameters todb.collection.mapReduce(). 一个配置方法参数的对象
bypassDocumentValidation	boolean	Optional. Enables mapReduce to bypass document validation during the operation. This lets you insert documents that do not meet the validation requirements. 可选是否绕过文档验证

其中：

Map函数：

map函数负责将每个输入文档转换为零个或多个文档。它可以访问范围参数中定义的变量，并具有以下原型：

function() {
...
emit(key, value);
}

在map函数中，在函数内使用this来引用当前文档。
map函功能应该是纯粹的，不应由于任何原因访问数据库。也不能对函数之外有任何影响。
map函数可以可选地调用emit（key，value）任意次数来创建将值与值相关联的输出文档。

Map函数可能会调用emit零次：

function() {
if (this.status == 'A')
emit(this.cust_id, 1);
}

也能是多次：

function() {
this.items.forEach(function(item){ emit(item.sku, 1); });
}

Reduce函数：

Reduce的结构：

function(key, values) {
...
return result;
}

Reduce函功能应该是纯粹的，不应由于任何原因访问数据库。也不能对函数之外有任何影响。
values参数是一个数组，其元素是"映射"到键的值对象数组。MongoDB不会调用只有一个值的键的reduce函数。
MongoDB可以对同一个键多次调用reduce函数。在这种情况下，该键的reduce函数的先前输出将成为该键的下一个reduce函数调用的输入值之一。

关于MongoDB不会调用只有一个值的键的reduce函数。实验如下：

//此时的values值不再是1 而是100
var m = function(){ emit(this.age,100) };
//对values进行循环每个+1 获取count
var r = function(key,values){
var sum = 0;
values.forEach(function(val){
sum += 1;
});
return sum;
}
//查看结果凡是values为1的都输出了100 不是我们想要的结果
> db.person.mapReduce(m,r,{out:'outt'}).find();
{ "_id" : 2, "value" : 100 }
{ "_id" : 30, "value" : 2 }
{ "_id" : "i do not know", "value" : 100 }

Options参数：

Field	Type	Description
out	string or document	Specifies the location of the result of the map-reduce operation. You can output to a collection, output to a collection with an action, or output inline. You may output to a collection when performing map reduce operations on the primary members of the set; on secondary members you may only use theinline output. 定义mapreduce操作的输出位置。
query	document	Specifies the selection criteria using query operators for determining the documents input to the map function. 定义一个查询这个查询将输入给map函数
sort	document	Sorts the input documents. This option is useful for optimization. For example, specify the sort key to be the same as the emit key so that there are fewer reduce operations. The sort key must be in an existing index for this collection. 指定为输入的document进行sort排序，sort的列上必须要有索引。 Sort主要为了提升性能，可以参考 http://www.csdn.net/article/2013-07-08/2816155-MongoDB-MapReduce-Optimization
limit	number	Specifies a maximum number of documents for the input into the map function. 定义输入给map的document的数量 limit
finalize	function	Optional. Follows the reduce method and modifies the output. 定义Reduce之后执行的操作
scope	document	Specifies global variables that are accessible in the map, reduce andfinalize functions. 定义在map reduct 和finalize中可以用的全局变量
jsMode	boolean	Specifies whether to convert intermediate data into BSON format between the execution of the map and reduce functions. Defaults to false. If false: Internally, MongoDB converts the JavaScript objects emitted by the mapfunction to BSON objects. These BSON objects are then converted back to JavaScript objects when calling the reduce function. The map-reduce operation places the intermediate BSON objects in temporary, on-disk storage. This allows the map-reduce operation to execute over arbitrarily large data sets. •map函数执行过程中，MongoDB将map函数发出的JavaScript对象转换为BSON对象。当调用reduce函数时，这些BSON对象再转换回JavaScript对象。 •这样map-reduce操作将中间BSON对象放置在临时的磁盘存储中。这允许map-reduce操作在任意大的数据集上执行。 If true: Internally, the JavaScript objects emitted during map function remain as JavaScript objects. There is no need to convert the objects for thereduce function, which can result in faster execution. You can only use jsMode for result sets with fewer than 500,000 distinct key arguments to the mapper's emit() function. 在map函数执行过程中，map函数期间发出的JavaScript对象将保留为JavaScript对象。没有必要转换对象然后给reduct功能，这可能导致执行速度更快。只能对映射器的emit（）函数使用少于500,000个不同关键参数的结果集使用jsMode。 The jsMode defaults to false.
verbose	Boolean	Specifies whether to include the timing information in the result information. The verbose defaults to true to include the timing information. 指定是否在结果信息中包含时间信息。 verbose默认为true以包含时间信息。
collation	document	Optional. Specifies the collation to use for the operation. Collation allows users to specify language-specific rules for string comparison, such as rules for lettercase and accent marks. The collation option has the following syntax: collation: { locale: <string>, caseLevel: <boolean>, caseFirst: <string>, strength: <int>, numericOrdering: <boolean>, alternate: <string>, maxVariable: <string>, backwards: <boolean> } When specifying collation, the locale field is mandatory; all other collation fields are optional. For descriptions of the fields, see Collation Document. If the collation is unspecified but the collection has a default collation (seedb.createCollection()), the operation uses the collation specified for the collection. If no collation is specified for the collection or for the operations, MongoDB uses the simple binary comparison used in prior versions for string comparisons. You cannot specify multiple collations for an operation. For example, you cannot specify different collations per field, or if performing a find with a sort, you cannot use one collation for the find and another for the sort. New in version 3.4. 可选定义字符串的输出格式如大小写首字母等。

其中：

Out参数：

输出到一个新集合：

out: <collectionName>

输出到一个已存在的集合：

out: { <action>: <collectionName>
[, db: <dbName>]
[, sharded: <boolean> ]
[, nonAtomic: <boolean> ] }

<action>可以是：

replace 如果<collectionName>的集合存在，则替换<collectionName>的内容。
merge 如果输出集合已经存在，则将新结果与现有结果合并。如果现有文档与新结果具有相同的键，则覆盖现有文档。
reduce 如果输出集合已经存在，则将新结果与现有结果合并。如果现有文档与新结果具有相同的键，则将reduce函数应用于新文档和现有文档，并使用结果覆盖现有文档。

Db sharded nonAtomic都是可选的。

Sharded：可选的。如果为true并且已经在输出数据库上启用分片，则map-reduce操作将使用_id字段作为分片键来分割输出集合。

nonAtomic：

可选的。将输出操作指定为非原子。这仅适用于merge 和reduce 输出模式，这可能需要几分钟才能执行。

默认情况下，nonAtomic为false，map-reduce操作在后处理期间锁定数据库。

如果nonAtomic为true，则后处理步骤可防止MongoDB锁定数据库：在此期间，其他客户端将能够读取输出集合的中间状态。

输出到Inline：

在内存中执行map-reduce操作并返回结果

out: { inline: 1 }

效果：

> db.person.mapReduce(m,r,{out:{inline:1}})
{
"results" : [
{
"_id" : 2,
"value" : 100
},
{
"_id" : 30,
"value" : 2
},
{
"_id" : "i do not know",
"value" : 100
}
],
"timeMillis" : 0,
"counts" : {
"input" : 4,
"emit" : 4,
"reduce" : 1,
"output" : 3
},
"ok" : 1
}
> db.person.mapReduce(m,r,{out:{inline:1}}).find()
[
{
"_id" : 2,
"value" : 100
},
{
"_id" : 30,
"value" : 2
},
{
"_id" : "i do not know",
"value" : 100
}
]