Mongoid Paging and Iterating Over Large Collections
遍历数据库中的所有记录时,我们首先想到的是Model.all.each
。但是,当数据量很大的时候(数万?),这就不怎么合适了,因为Model.all.each
会一次性加载所有记录,并将其实例化成 Model 对象,这显然会增加内存负担,甚至耗尽内存。
对于ActiveRecord
而言,有个find_each
专门解决此类问题。find_each
底层依赖于find_in_batches
,会分批加载记录,默认每批为1000。
对Mongoid
而言,话说可以直接用Person.all.each
,它会自动利用游标(cursor
)帮你分批加载的。不过有个问题得留意一下:cursor
有个10分钟超时限制。这就意味着遍历时长超过10分钟就危险了,很可能在中途遭遇no cursor
错误。
# gems/mongo-2.2.4/lib/mongo/collection.rb:218
# @option options [ true, false ] :no_cursor_timeout The server normally times out idle cursors
# after an inactivity period (10 minutes) to prevent excess memory use. Set this option to prevent that.
虽然可以这样来绕过它 Model.all.no_timeout.each
,不过不建议这样做。另外默认的batch_size
不一定合适你,可以这样指定Model.all.no_timeout.batch_size(500).each
。不过 mongodb 的默认batch_size
看起来比较复杂,得谨慎。(The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. To override the default size of the batch, see batchSize() and limit().
For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.
As you iterate through the cursor and reach the end of the returned batch, if there are more results, cursor.next() will perform a getmore operation to retrieve the next batch. )
Model.all.each { print '.' }
将得到类似的查询:
类似的方案应是利用skip
和limit
,类似于这样Model.all.skip(m).limit(n)
。很不幸的是,数据量过大时,这并不好使,因为随着 skip 值变大,会越来越慢(The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results. As the offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.)。这让我想起了曾经看过的帖子will_paginate 分页过多 (大概 10000 页),点击最后几页的时候,速度明显变慢,大致原因就是分页底层用到了offset
,也是因为offset 越大查询就会越慢。
让我们再次回到ActiveRecord
的find_each
上来。Rails 考虑得很周全,它底层没利用offset
,而是将每批查询的最后一条记录的id
作为下批查询的primary_key_offset
:
# gems/activerecord-4.2.5.1/lib/active_record/relation/batches.rb:98
def find_in_batches(options = {})
options.assert_valid_keys(:start, :batch_size)
relation = self
start = options[:start]
batch_size = options[:batch_size] || 1000
unless block_given?
return to_enum(:find_in_batches, options) do
total = start ? where(table[primary_key].gteq(start)).size : size
(total - 1).div(batch_size) + 1
end
end
if logger && (arel.orders.present? || arel.taken.present?)
logger.warn("Scoped order and limit are ignored, it's forced to be batch order and batch size")
end
relation = relation.reorder(batch_order).limit(batch_size)
records = start ? relation.where(table[primary_key].gteq(start)).to_a : relation.to_a
while records.any?
records_size = records.size
primary_key_offset = records.last.id
raise "Primary key not included in the custom select clause" unless primary_key_offset
yield records
break if records_size < batch_size
records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
end
end
前面的Model.all.no_timeout.batch_size(1000).each
是 server 端的批量查询,我们也可模仿出client 端的批量查询,即Mongoid
版的find_each
:
# /config/initializers/mongoid_batches.rb
module Mongoid
module Batches
def find_each(batch_size = 1000)
return to_enum(:find_each, batch_size) unless block_given?
find_in_batches(batch_size) do |documents|
documents.each { |document| yield document }
end
end
def find_in_batches(batch_size = 1000)
return to_enum(:find_in_batches, batch_size) unless block_given?
documents = self.limit(batch_size).asc(:id).to_a
while documents.any?
documents_size = documents.size
primary_key_offset = documents.last.id
yield documents
break if documents_size < batch_size
documents = self.gt(id: primary_key_offset).limit(batch_size).asc(:id).to_a
end
end
end
end
Mongoid::Criteria.include Mongoid::Batches
最后对于耗时操作,还可考虑引入并行计算,类似于这样:
Model.find_each { ... }
Model.find_in_batches do |items|
Parallel.each items, in_processes: 4 do |item|
# ...
end
end
参考链接
- https://docs.mongodb.com/manual/core/cursors/
- https://docs.mongodb.com/master/reference/method/cursor.skip/
- https://ruby-china.org/topics/28659
Mongoid Paging and Iterating Over Large Collections的更多相关文章
- Learning part-based templates from large collections of 3D shapse CorrsTmplt Kim 代码调试
平台: VMware上装的Ubuntu-15.10 环境准备工作:装Fortran, lapack, blas, cblas (理论上装好lapack后面两个应该是自动的),其他的有需要的随时安装就可 ...
- Java性能提示(全)
http://www.onjava.com/pub/a/onjava/2001/05/30/optimization.htmlComparing the performance of LinkedLi ...
- EF 5 最佳实践白皮书
Performance Considerations for Entity Framework 5 By David Obando, Eric Dettinger and others Publish ...
- Java 8 Stream API Example Tutorial
Stream API Overview Before we look into Java 8 Stream API Examples, let’s see why it was required. S ...
- Unity 5 Game Optimization (Chris Dickinson 著)
1. Detecting Performance Issues 2. Scripting Strategies 3. The Benefits of Batching 4. Kickstart You ...
- 【转】最佳Restful API 实践
原文转自:https://bourgeois.me/rest/ REST APIs are a very common topic nowaday; they are part of almost e ...
- 【翻译十九】-java之执行器
Executors In all of the previous examples, there's a close connection between the task being done by ...
- 【翻译十七】java-并发之高性能对象
High Level Concurrency Objects So far, this lesson has focused on the low-level APIs that have been ...
- Understanding, Operating and Monitoring Apache Kafka
Apache Kafka is an attractive service because it's conceptually simple and powerful. It's easy to un ...
随机推荐
- gcc常用参数列举
[参数详解] -c 只激活预处理,编译,和汇编,也就是他只把程序做成obj文件 例子用法: gcc -c hello.c 他将生成.o的obj文件 -S 只激活预处理和编译,就是指 ...
- Linux Mint,Ubuntu 18 ,Deepin15.7 安装mysql 没有提示输入密码,修改root用户密码过程
刚刚装Deepin15.7 和 MySQL5.7 发现没有提示用户输入密码的过程(近日发现Linux Mint 和 Ubuntu18 也适用) 百度了一大堆如何修改root密码 也没什么卵用,终于这篇 ...
- 官方发布PHP语法规范
PHP语言已经存在了超过20年,显然是世界上最流行的编程语言之一.PHP绝对是互联网服务器端web编程的通用语. 虽然有广泛的user-documentation,PHP语言总是错过语言规范.这并不是 ...
- 转:adb操作命令详解及大全
说到 ADB 大家应该都不陌生,即 Android Debug Bridge,Android调试桥,身为 Android 开发的我们,熟练使用 ADB 命令将会大大提升我们的开发效率, ADB 的命令 ...
- express_webpack自动刷新
现在,webpack可以说是最流行的模块加载器(module bundler).一方面,它为前端静态资源的组织和管理提供了相对较完善的解决方案,另一方面,它也很大程度上改变了前端开发的工作流程.在应用 ...
- HTTP的DELETE方法Body传递参数问题解决
理论上,Delete method是通过url传递参数的,如果使用body传递参数呢? 前提: 使用HttpClient发送http请求 问题: httpDelete对象木有setEntity方法 解 ...
- ADO.NET 之断开连接层
定义: 使用ADO.NET断开连接层,就会使用System.Data命名空间的许多成员(主要是DataTable.DataTable.DataRow.DataColumn.DataView和DataR ...
- Make a Person-freecodecamp算法题目
Make a Person 1.要求 用下面给定的方法构造一个对象:方法有 getFirstName(), getLastName(), getFullName(), setFirstName(fir ...
- Liunx 配置sshd服务
简介 SSH(Secure Shell)是一种能够提供安全远程登录会话的协议,也是目前远程管理Linux系统最首选的方式,因为传统的ftp或telnet服务是不安全的,它们会把帐号口令和数据资料等数据 ...
- JavaScript对象回收机制
js维护了一张对象引用表: 当一个对象被创建以后,栈内就有一个a,a这个对象就指向了对这个地址,当a=new Person()执行后,引用次数加1.当a=null置空,引用次数减1.由系统来维护对象引 ...